Skip to content

Transcription

Casola transcribes spoken audio into text with timestamps and segment data. Navigate to /transcribe in Studio to get started.

Speech to text transcription page

Three input methods are available:

Upload a file — Drag and drop an audio file onto the upload area, or click to browse. Supports all standard audio formats (MP3, WAV, M4A, OGG, etc.) up to 25 MB.

Record audio — Click the record button to capture audio directly from your microphone. A playback preview appears when you stop recording so you can verify before submitting.

Audio URL — Paste a direct HTTPS link to an audio file hosted online.

Model — Casola currently supports Whisper Large v3 for speech-to-text. The model selector shows availability status. See the Models reference for the latest options.

Language — Choose the spoken language or leave on Auto-detect (the default). Specifying the language can improve accuracy for non-English audio. Supported languages include English, Spanish, French, German, Italian, Portuguese, Russian, Japanese, Korean, and Chinese.

Response format — Select how the transcription is structured:

FormatDescriptionBest for
JSONFull response with segments, timestamps, and metadataProgrammatic use
TextPlain text transcriptionQuick reading, copy-paste
SRTSubRip subtitle format with timestampsVideo subtitles
VTTWebVTT subtitle format with timestampsWeb video players

Completed transcriptions display the full text along with metadata: word count, character count, detected language, and audio duration.

When the model returns segments (available with JSON format), each segment is shown with its start and end timestamps — useful for aligning text to specific moments in the audio.

You can toggle between a formatted text view and a raw JSON view when using the JSON format.

  • Copy — Click the copy button to place the full transcription on your clipboard
  • Download — Save the transcription as a file matching your chosen format (.json, .txt, .srt, or .vtt)

All transcription results are automatically saved to your Library.

Terminal window
curl https://api.casola.ai/openai/v1/audio/transcriptions \
-H "Authorization: Bearer YOUR_API_TOKEN" \
-H "Content-Type: application/json" \
-d '{
"model": "whisper-large-v3",
"audio_url": "https://example.com/meeting-recording.mp3",
"language": "en",
"response_format": "verbose_json"
}'

Response:

{
"task": "transcribe",
"language": "en",
"duration": 125.4,
"text": "Welcome to today's meeting. Let's start with the agenda...",
"segments": [
{
"start": 0.0,
"end": 3.2,
"text": "Welcome to today's meeting."
},
{
"start": 3.5,
"end": 6.1,
"text": "Let's start with the agenda..."
}
]
}

Use multipart/form-data to upload an audio file directly:

Terminal window
curl https://api.casola.ai/openai/v1/audio/transcriptions \
-H "Authorization: Bearer YOUR_API_TOKEN" \
-F model="whisper-large-v3" \
-F file=@recording.mp3 \
-F response_format="json"

Response:

{
"text": "Welcome to today's meeting. Let's start with the agenda..."
}
Terminal window
curl https://api.casola.ai/openai/v1/audio/transcriptions \
-H "Authorization: Bearer YOUR_API_TOKEN" \
-H "Content-Type: application/json" \
-d '{
"model": "whisper-large-v3",
"audio_url": "https://example.com/video-audio.mp3",
"response_format": "srt"
}'

Response (plain text):

1
00:00:00,000 --> 00:00:03,200
Welcome to today's meeting.
2
00:00:03,500 --> 00:00:06,100
Let's start with the agenda...
from openai import OpenAI
client = OpenAI(
base_url="https://api.casola.ai/openai/v1",
api_key="YOUR_API_TOKEN",
)
# From file
with open("recording.mp3", "rb") as f:
transcript = client.audio.transcriptions.create(
model="whisper-large-v3",
file=f,
response_format="verbose_json",
)
print(transcript.text)
for segment in transcript.segments:
print(f"[{segment.start:.1f}s] {segment.text}")

Transcription speed depends on the length of the audio. Most files under 5 minutes complete within 10–30 seconds. Longer recordings may take proportionally more time.

  • For best accuracy, use audio with clear speech and minimal background noise.
  • Specify the language explicitly when transcribing non-English audio — auto-detection works well but a language hint improves results.
  • Use SRT or VTT format if you need subtitles for a video project.
  • The JSON format includes the richest data (segments with timestamps) and is the best choice when you need to process the output programmatically.