Skip to content

Voice

Casola converts text into natural-sounding speech. Navigate to /voice in Studio to get started.

Voice generation page with audio player and settings

  1. Select a model — Choose a TTS model from the dropdown. Casola currently supports Qwen3 TTS for speech synthesis. See the Models reference for the latest options.
  2. Pick a voice — Each model offers a set of voices with different characteristics. The dropdown shows the voice name, description, and language. Click the preview button next to the voice selector to hear a short sample before committing.
  3. Enter your text — Type or paste the text you want spoken. Press Enter to generate, or Shift+Enter to add a new line.
  4. Generate — Click the button (or press Enter) and your audio will be ready in a few seconds.

Choose your output format before generating:

FormatBest for
MP3General use, small file size
WAVLossless quality, audio editing
OPUSStreaming, low bandwidth
FLACLossless with compression
AACApple ecosystem, podcasts

Use the Speed slider to adjust playback from 0.25x to 4.0x. This is set before generation and affects the output file itself, not just preview playback.

How you write your text affects how it sounds:

  • Ellipsis (…) — Creates a longer pause
  • Commas — Add short pauses for natural pacing
  • CAPS — Emphasizes a word
  • Question marks — Produce rising intonation
  • Hyphens — E-nun-ci-ate individual syllables

Each generated audio clip appears in your history with the voice name, format, speed, and timestamp. You can:

  • Play the audio directly with the inline player (Space to play/pause, arrow keys to skip)
  • Download the file in your chosen format
  • Retry a failed generation
  • Delete individual entries or clear all history

All generated audio is automatically saved to your Library for later access.

Terminal window
curl https://api.casola.ai/openai/v1/audio/speech \
-H "Authorization: Bearer YOUR_API_TOKEN" \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen3-TTS",
"input": "Hello! Welcome to Casola.",
"voice": "alloy",
"response_format": "mp3",
"speed": 1.0
}' \
--output speech.mp3

The response is the binary audio file. Use --output to save it directly.

For longer text, use async mode to avoid timeouts:

Terminal window
# Submit
curl https://api.casola.ai/openai/v1/audio/speech \
-H "Authorization: Bearer YOUR_API_TOKEN" \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen3-TTS",
"input": "A longer passage of text to convert to speech...",
"voice": "alloy",
"response_format": "wav",
"async": true
}'

Response (202):

{
"id": "job_abc123",
"object": "audio.speech.async",
"status": "pending"
}

Poll for the result using GET /fal/requests/{id}.

Terminal window
curl https://api.casola.ai/api/voice/models \
-H "Authorization: Bearer YOUR_API_TOKEN"

Response:

{
"voice_models": [
{
"model_id": "Qwen3-TTS",
"label": "Qwen3 TTS",
"voices": [
{
"id": "alloy",
"name": "Alloy",
"description": "Neutral and balanced",
"language": "en"
}
]
}
]
}
from openai import OpenAI
client = OpenAI(
base_url="https://api.casola.ai/openai/v1",
api_key="YOUR_API_TOKEN",
)
response = client.audio.speech.create(
model="Qwen3-TTS",
input="Hello! Welcome to Casola.",
voice="alloy",
response_format="mp3",
)
response.stream_to_file("speech.mp3")
  • Preview voices before generating long passages — each voice has a distinct tone and cadence.
  • Use punctuation deliberately to control pacing; a well-placed comma or ellipsis makes speech sound more natural.
  • For long-form content (articles, scripts), break text into shorter paragraphs and generate each separately for more consistent quality.
  • WAV or FLAC are best when you plan to edit the audio afterward; MP3 is fine for direct use.