Comprehensive Guide to Advanced Audio Transcription and Alignment with WhisperX
This guide delves into a sophisticated approach to audio transcription using WhisperX, focusing on detailed processes such as transcription, precise alignment, and generating word-level timestamps. We will cover environment setup, audio loading and preprocessing, and executing the entire workflow-from transcription to alignment and in-depth analysis-while optimizing for memory usage and enabling batch processing. Additionally, we demonstrate how to visualize outputs, export results in various formats, and extract key terms to enhance understanding of the audio content.
Setting Up the Environment and Configuration
First, we install WhisperX along with essential Python libraries like pandas, matplotlib, and seaborn. The system automatically detects if a CUDA-enabled GPU is available to leverage faster computation, selecting the appropriate data precision (float16 for GPU, int8 for CPU). We configure parameters such as batch size, model variant, and language preferences to tailor the transcription process.
!pip install -q git+https://github.com/m-bain/whisperX.git
!pip install -q pandas matplotlib seaborn
import whisperx
import torch
import gc
import os
import json
import pandas as pd
from pathlib import Path
from IPython.display import Audio, display, HTML
import warnings
warnings.filterwarnings('ignore')
CONFIG = {
"device": "cuda" if torch.cuda.isavailable() else "cpu",
"computetype": "float16" if torch.cuda.isavailable() else "int8",
"batchsize": 16,
"modelsize": "base",
"language": None,
}
print(f"🚀 Running on device: {CONFIG['device']}")
print(f"📊 Compute precision: {CONFIG['computetype']}")
print(f"🎯 Model selected: {CONFIG['modelsize']}")
Downloading and Preparing Audio for Transcription
We provide a utility to download a sample audio file for testing purposes. After loading the audio, we display key metadata such as filename, duration, and sample rate, and play the audio inline for quick verification.
def downloadsampleaudio():
"""Fetch a sample audio clip for demonstration."""
!wget -q -O sample.mp3 https://github.com/mozilla-extensions/speaktome/raw/master/content/cv-valid-dev/sample-000000.mp3
print("✅ Sample audio successfully downloaded.")
return "sample.mp3"
def loadandinspectaudio(audiopath):
"""Load audio data and present basic information."""
audio = whisperx.loadaudio(audiopath)
duration = len(audio) / 16000 # Assuming 16kHz sample rate
print(f"📁 Audio file: {Path(audiopath).name}")
print(f"⏱ Duration: {duration:.2f} seconds")
print(f"🎵 Sample rate: 16000 Hz")
display(Audio(audiopath))
return audio, duration
Executing Batched Transcription with WhisperX
Using the configured model, we transcribe the audio in batches to optimize performance. The transcription output includes segmented text and language detection. After transcription, we clear memory caches to maintain efficiency.
def transcribeaudio(audio, modelsize=CONFIG["modelsize"], language=None):
"""Perform batched transcription on audio input."""
print("n🎤 STEP 1: Starting transcription...")
model = whisperx.loadmodel(
modelsize,
CONFIG["device"],
computetype=CONFIG["computetype"]
)
transcribeparams = {"batchsize": CONFIG["batchsize"]}
if language:
transcribeparams["language"] = language
result = model.transcribe(audio, *transcribeparams)
totalsegments = len(result["segments"])
totalwords = sum(len(seg.get("words", [])) for seg in result["segments"])
del model
gc.collect()
if CONFIG["device"] == "cuda":
torch.cuda.emptycache()
print("✅ Transcription completed!")
print(f" Detected language: {result['language']}")
print(f" Number of segments: {totalsegments}")
print(f" Total characters transcribed: {sum(len(seg['text']) for seg in result['segments'])}")
return result
Refining Transcription with Word-Level Alignment
To enhance timestamp accuracy, we align the transcription at the word level. This step loads a dedicated alignment model and applies it to the audio and transcription segments. The process reports the number of words successfully aligned and handles exceptions gracefully, falling back to segment-level timestamps if alignment fails.
def aligntranscription(segments, audio, languagecode):
"""Refine transcription by aligning words with precise timestamps."""
print("n🎯 STEP 2: Performing word-level alignment...")
try:
alignmodel, metadata = whisperx.loadalignmodel(
languagecode=languagecode,
device=CONFIG["device"]
)
alignedresult = whisperx.align(
segments,
alignmodel,
metadata,
audio,
CONFIG["device"],
returncharalignments=False
)
totalalignedwords = sum(len(seg.get("words", [])) for seg in alignedresult["segments"])
del alignmodel
gc.collect()
if CONFIG["device"] == "cuda":
torch.cuda.emptycache()
print("✅ Alignment successful!")
print(f" Words aligned: {totalalignedwords}")
return alignedresult
except Exception as e:
print(f"⚠️ Alignment error: {str(e)}")
print(" Proceeding with segment-level timestamps only.")
return {"segments": segments, "wordsegments": []}
In-Depth Analysis of Transcription Output
We generate comprehensive statistics to better understand the transcription’s characteristics. This includes total audio duration, segment count, word and character totals, speaking rate (words per minute), average pauses between segments, and average word duration. These metrics provide insights into the pacing and structure of the spoken content.
def analyzetranscription(result):
"""Compute and display detailed transcription statistics."""
print("n📊 TRANSCRIPTION ANALYSIS")
print("="70)
segments = result["segments"]
totalduration = max(seg["end"] for seg in segments) if segments else 0
totalwords = sum(len(seg.get("words", [])) for seg in segments)
totalchars = sum(len(seg["text"].strip()) for seg in segments)
print(f"Total audio length: {totalduration:.2f} seconds")
print(f"Number of segments: {len(segments)}")
print(f"Total words: {totalwords}")
print(f"Total characters: {totalchars}")
if totalduration > 0:
wpm = (totalwords / totalduration) 60
print(f"Words per minute: {wpm:.1f}")
pauses = [segments[i+1]["start"] - segments[i]["end"] for i in range(len(segments)-1) if segments[i+1]["start"] > segments[i]["end"]]
if pauses:
avgpause = sum(pauses) / len(pauses)
maxpause = max(pauses)
print(f"Average pause between segments: {avgpause:.2f} seconds")
print(f"Longest pause: {maxpause:.2f} seconds")
worddurations = [word["end"] - word["start"] for seg in segments if "words" in seg for word in seg["words"]]
if worddurations:
avgwordduration = sum(worddurations) / len(worddurations)
print(f"Average word duration: {avgwordduration:.3f} seconds")
print("="70)
Presenting and Exporting Transcription Results
We format the transcription data into clear, tabular displays, optionally showing word-level details. The results can be exported in multiple widely-used formats including JSON, SRT, VTT, TXT, and CSV, each preserving timestamps and text for various use cases such as subtitle generation or further analysis.
def displayresults(result, showwords=False, maxrows=50):
"""Render transcription data in a structured table format."""
data = []
for seg in result["segments"]:
text = seg["text"].strip()
start = f"{seg['start']:.2f}s"
end = f"{seg['end']:.2f}s"
duration = f"{seg['end'] - seg['start']:.2f}s"
if showwords and "words" in seg:
for word in seg["words"]:
data.append({
"Start": f"{word['start']:.2f}s",
"End": f"{word['end']:.2f}s",
"Duration": f"{word['end'] - word['start']:.3f}s",
"Text": word["word"],
"Score": f"{word.get('score', 0):.2f}"
})
else:
data.append({
"Start": start,
"End": end,
"Duration": duration,
"Text": text
})
df = pd.DataFrame(data)
if len(df) > maxrows:
print(f"Displaying first {maxrows} rows out of {len(df)} total...")
display(HTML(df.head(maxrows).tohtml(index=False)))
else:
display(HTML(df.tohtml(index=False)))
return df
def exportresults(result, outputdir="output", filename="transcript"):
"""Save transcription outputs in multiple file formats."""
os.makedirs(outputdir, existok=True)
jsonpath = f"{outputdir}/{filename}.json"
with open(jsonpath, "w", encoding="utf-8") as f:
json.dump(result, f, indent=2, ensureascii=False)
srtpath = f"{outputdir}/{filename}.srt"
with open(srtpath, "w", encoding="utf-8") as f:
for i, seg in enumerate(result["segments"], 1):
start = formattimestamp(seg["start"])
end = formattimestamp(seg["end"])
f.write(f"{i}n{start} --> {end}n{seg['text'].strip()}nn")
vttpath = f"{outputdir}/{filename}.vtt"
with open(vttpath, "w", encoding="utf-8") as f:
f.write("WEBVTTnn")
for i, seg in enumerate(result["segments"], 1):
start = formattimestampvtt(seg["start"])
end = formattimestampvtt(seg["end"])
f.write(f"{start} --> {end}n{seg['text'].strip()}nn")
txtpath = f"{outputdir}/{filename}.txt"
with open(txtpath, "w", encoding="utf-8") as f:
for seg in result["segments"]:
f.write(f"{seg['text'].strip()}n")
csvpath = f"{outputdir}/{filename}.csv"
csvdata = [{"start": seg["start"], "end": seg["end"], "text": seg["text"].strip()} for seg in result["segments"]]
pd.DataFrame(csvdata).tocsv(csvpath, index=False)
print(f"n💾 Exported files to '{outputdir}/':")
print(f" ✓ {filename}.json (structured data)")
print(f" ✓ {filename}.srt (subtitle format)")
print(f" ✓ {filename}.vtt (web video subtitles)")
print(f" ✓ {filename}.txt (plain transcript)")
print(f" ✓ {filename}.csv (timestamps and text)")
def formattimestamp(seconds):
"""Convert seconds to SRT timestamp format (HH:MM:SS,mmm)."""
hours = int(seconds // 3600)
minutes = int((seconds % 3600) // 60)
secs = int(seconds % 60)
millis = int((seconds % 1) 1000)
return f"{hours:02d}:{minutes:02d}:{secs:02d},{millis:03d}"
def formattimestampvtt(seconds):
"""Convert seconds to VTT timestamp format (HH:MM:SS.mmm)."""
hours = int(seconds // 3600)
minutes = int((seconds % 3600) // 60)
secs = int(seconds % 60)
millis = int((seconds % 1) 1000)
return f"{hours:02d}:{minutes:02d}:{secs:02d}.{millis:03d}"
Batch Processing Multiple Audio Files
For handling large datasets, we implement batch processing that sequentially transcribes and aligns multiple audio files. Each file’s results are exported individually, and errors are logged without interrupting the entire batch.
def batchprocessfiles(audiofiles, outputdir="batchoutput"):
"""Transcribe and align multiple audio files in sequence."""
print(f"n📦 Starting batch processing for {len(audiofiles)} files...")
results = {}
for idx, audiopath in enumerate(audiofiles, 1):
print(f"n[{idx}/{len(audiofiles)}] Processing file: {Path(audiopath).name}")
try:
result, = processaudiofile(audiopath, showoutput=False)
results = result
filename = Path(audiopath).stem
exportresults(result, outputdir, filename)
except Exception as e:
print(f"❌ Failed to process {audiopath}: {str(e)}")
results = None
print(f"n✅ Batch processing finished. Total files processed: {len(results)}")
return results
Extracting Key Terms from Transcriptions
To identify the most frequent and meaningful words in the transcript, we extract keywords by filtering out common stop words and counting occurrences. This helps in summarizing the main topics or themes present in the audio.
def extractkeywords(result, topn=10):
"""Identify the most frequent significant words in the transcript."""
from collections import Counter
import re
fulltext = " ".join(seg["text"] for seg in result["segments"])
# Extract words consisting of alphabetic characters
words = re.findall(r'bw+b', fulltext.lower())
stopwords = {
'the', 'a', 'an', 'and', 'or', 'but', 'in', 'on', 'at', 'to', 'for',
'of', 'with', 'is', 'was', 'are', 'were', 'be', 'been', 'being',
'have', 'has', 'had', 'do', 'does', 'did', 'will', 'would', 'could',
'should', 'may', 'might', 'must', 'can', 'this', 'that', 'these', 'those'
}
filteredwords = [w for w in words if w not in stopwords and len(w) > 2]
wordcounts = Counter(filteredwords).mostcommon(topn)
print(f"n🔑 Top {topn} Keywords:")
for word, count in wordcounts:
print(f" {word}: {count}")
return wordcounts
Complete WhisperX Transcription Pipeline
This function orchestrates the entire process: loading audio, transcribing, aligning, analyzing, displaying, and exporting results. It supports toggling output display and analysis features, making it adaptable for different use cases.
def processaudiofile(audiopath, showoutput=True, analyze=True):
"""Run the full WhisperX transcription and alignment workflow."""
if showoutput:
print("="70)
print("🎵 WhisperX Advanced Transcription Pipeline")
print("="70)
audio, duration = loadandinspectaudio(audiopath)
transcriptionresult = transcribeaudio(audio, CONFIG["modelsize"], CONFIG["language"])
alignedresult = aligntranscription(
transcriptionresult["segments"],
audio,
transcriptionresult["language"]
)
if analyze and showoutput:
analyzetranscription(alignedresult)
extractkeywords(alignedresult)
if showoutput:
print("n" + "="70)
print("📋 TRANSCRIPTION OUTPUT")
print("="70)
df = displayresults(alignedresult, showwords=False)
exportresults(alignedresult)
else:
df = None
return alignedresult, df
Getting Started
To begin, uncomment and run any of the following examples:
- Process the sample audio file:
audiopath = downloadsampleaudio()
result, df = processaudiofile(audiopath) - Display word-level transcription details:
worddf = displayresults(result, showwords=True) - Transcribe your own audio file:
audiopath = "youraudiofile.wav"
result, df = processaudiofile(audiopath) - Batch process multiple audio files:
audiofiles = ["file1.mp3", "file2.wav", "file3.m4a"]
results = batchprocessfiles(audiofiles) - Use a larger WhisperX model for improved accuracy:
CONFIG["modelsize"] = "large-v2"
result, df = processaudiofile("audio.mp3")
✨ Setup is complete! Customize and extend this pipeline to suit your transcription and audio analysis projects.
Explore this powerful workflow to transform raw audio into rich, timestamped transcripts with insightful analytics. Whether for research, content creation, or accessibility, this end-to-end solution offers flexibility and precision for modern audio processing needs.

