Building a Speech Enhancement and Automatic Speech Recognition (ASR) Pipeline in Python Using SpeechBrain

September 11, 2025

Building an Advanced Speech Enhancement and Recognition Pipeline with SpeechBrain

This guide demonstrates a sophisticated yet accessible workflow for speech processing using SpeechBrain. We begin by synthesizing clean speech samples with gTTS, intentionally introduce noise to mimic real-world acoustic conditions, and then apply SpeechBrain’s MetricGAN+ model to improve audio clarity. Following enhancement, we perform automatic speech recognition (ASR) using a language model-rescored CRDNN system, comparing word error rates (WER) before and after denoising. This stepwise approach showcases how SpeechBrain facilitates the creation of a comprehensive speech enhancement and recognition pipeline with minimal code.

Setting Up the Environment and Dependencies

First, we prepare our environment by installing essential libraries such as SpeechBrain, gTTS, and audio processing tools. We configure paths and parameters, and detect whether a GPU is available to accelerate processing. This setup ensures a smooth foundation for building our speech pipeline.

!pip install -q -U speechbrain gTTS jiwer pydub librosa soundfile torchaudio
!apt-get -qq install -y ffmpeg >/dev/null

import os
import time
import warnings
import torch
import torchaudio
import numpy as np
import librosa
import soundfile as sf
from gtts import gTTS
from pydub import AudioSegment
from jiwer import wer
from pathlib import Path
from dataclasses import dataclass
from typing import List, Tuple
from IPython.display import Audio, display
from speechbrain.pretrained import EncoderDecoderASR, SpectralMaskEnhancement

warnings.filterwarnings("ignore")

root = Path("speechbraindemo")
root.mkdir(existok=True)
samplerate = 16000
device = "cuda" if torch.cuda.isavailable() else "cpu"

Utility Functions for Speech Synthesis, Noise Injection, and Playback

We define helper functions to streamline our workflow. The ttstowav function converts text to speech using gTTS and saves it as a WAV file. The addnoise function adds Gaussian noise at a specified signal-to-noise ratio (SNR) to simulate challenging listening environments. Additional utilities allow audio playback within notebooks and text normalization for accurate evaluation. A Sample dataclass organizes each utterance’s clean, noisy, and enhanced audio file paths.

def ttstowav(text: str, outputwav: str, lang="en"):
    tempmp3 = outputwav.replace(".wav", ".mp3")
    gTTS(text=text, lang=lang).save(tempmp3)
    audio = AudioSegment.fromfile(tempmp3, format="mp3").setchannels(1).setframerate(samplerate)
    audio.export(outputwav, format="wav")
    os.remove(tempmp3)

def addnoise(inputwav: str, snrdb: float, outputwav: str):
    y,  = librosa.load(inputwav, sr=samplerate, mono=True)
    rmssignal = np.sqrt(np.mean(y2) + 1e-12)
    noise = np.random.normal(0, 1, len(y))
    noise = noise / (np.sqrt(np.mean(noise2) + 1e-12))
    targetnoiserms = rmssignal / (10*(snrdb / 20))
    noisysignal = np.clip(y + noise  targetnoiserms, -1.0, 1.0)
    sf.write(outputwav, noisysignal, samplerate)

def playaudio(title: str, filepath: str):
    print(f"▶️ {title}: {filepath}")
    display(Audio(filepath, rate=samplerate))

def normalizetext(text: str) -> str:
    return " ".join("".join(ch.lower() if ch.isalnum() or ch.isspace() else " " for ch in text).split())

@dataclass
class Sample:
    text: str
    cleanwav: str
    noisywav: str
    enhancedwav: str

Generating Speech Samples and Introducing Noise

We create a set of example sentences, synthesize their clean speech versions, and generate noisy counterparts by adding noise at varying SNR levels. These samples are stored in Sample objects for easy management throughout the pipeline.

sentences = [
    "Machine learning is revolutionizing technology.",
    "Open-source frameworks accelerate innovation worldwide.",
    "SpeechBrain simplifies building speech applications in Python."
]

samples: List[Sample] = []

print("🗣️ Generating speech samples with gTTS...")
for idx, sentence in enumerate(sentences, 1):
    cleanpath = str(root / f"clean{idx}.wav")
    noisypath = str(root / f"noisy{idx}.wav")
    enhancedpath = str(root / f"enhanced{idx}.wav")
    ttstowav(sentence, cleanpath)
    noiselevel = 5.0 if idx % 2 == 1 else 0.0  # Add noise to odd samples
    addnoise(cleanpath, snrdb=noiselevel, outputwav=noisypath)
    samples.append(Sample(text=sentence, cleanwav=cleanpath, noisywav=noisypath, enhancedwav=enhancedpath))

playaudio("Clean Sample #1", samples[0].cleanwav)
playaudio("Noisy Sample #1", samples[0].noisywav)

Loading Pretrained Models for Enhancement and Recognition

We load SpeechBrain’s pretrained MetricGAN+ model for speech enhancement and a CRDNN-based ASR model with language model rescoring. These models provide the core functionality to denoise audio and transcribe speech with improved accuracy.

print("⬇️ Downloading pretrained models...")
asrmodel = EncoderDecoderASR.fromhparams(
    source="speechbrain/asr-crdnn-rnnlm-librispeech",
    runopts={"device": device},
    savedir=str(root / "pretrainedasr"),
)

enhancementmodel = SpectralMaskEnhancement.fromhparams(
    source="speechbrain/metricgan-plus-voicebank",
    runopts={"device": device},
    savedir=str(root / "pretrainedenhancement"),
)

Enhancing Audio and Evaluating Recognition Performance

We define functions to enhance noisy audio files, transcribe speech, and calculate WER by comparing transcriptions to reference text. Processing each sample, we measure the impact of enhancement on recognition accuracy and record inference times.

def enhanceaudio(inputwav: str, outputwav: str):
    enhancedsignal = enhancementmodel.enhancefile(inputwav)
    if enhancedsignal.dim() == 1:
        enhancedsignal = enhancedsignal.unsqueeze(0)
    torchaudio.save(outputwav, enhancedsignal.cpu(), samplerate)

def transcribeaudio(wavpath: str) -> str:
    transcription = asrmodel.transcribefile(wavpath)
    return normalizetext(transcription)

def evaluatetranscription(reference: str, wavpath: str) -> Tuple[str, float]:
    hypothesis = transcribeaudio(wavpath)
    errorrate = wer(normalizetext(reference), hypothesis)
    return hypothesis, errorrate

print("🔬 Comparing ASR results on noisy vs enhanced audio...")
results = []
starttime = time.time()

for sample in samples:
    enhanceaudio(sample.noisywav, sample.enhancedwav)
    hypnoisy, wernoisy = evaluatetranscription(sample.text, sample.noisywav)
    hypenhanced, werenhanced = evaluatetranscription(sample.text, sample.enhancedwav)
    results.append((sample.text, hypnoisy, wernoisy, hypenhanced, werenhanced))

endtime = time.time()

Presenting Results and Insights

We display detailed transcription results for each utterance, including WER before and after enhancement, and report total inference time. Additionally, batch decoding is demonstrated to highlight efficiency gains. Listening to enhanced audio samples illustrates the qualitative improvements achieved.

def formatfloat(value):
    return f"{value:.3f}" if isinstance(value, float) else value

print(f"⏱️ Total inference time: {endtime - starttime:.2f} seconds on {device.upper()}")
print("n# --- Recognition Results (Noisy → Enhanced) ---")

for idx, (ref, noisyhyp, noisywer, enhhyp, enhwer) in enumerate(results, 1):
    print(f"nUtterance {idx}")
    print("Reference:      ", ref)
    print("Noisy ASR:     ", noisyhyp)
    print("WER (Noisy):   ", formatfloat(noisywer))
    print("Enhanced ASR:  ", enhhyp)
    print("WER (Enhanced):", formatfloat(enhwer))

print("n🔄 Batch decoding clean and noisy samples:")
batchfiles = [s.cleanwav for s in samples] + [s.noisywav for s in samples]
batchstart = time.time()
batchtranscriptions = [transcribeaudio(f) for f in batchfiles]
batchend = time.time()

for filepath, transcription in zip(batchfiles, batchtranscriptions):
    print(f"{os.path.basename(filepath)} → {transcription[:80]}{'...' if len(transcription) > 80 else ''}")

print(f"⏳ Batch decoding time: {batchend - batchstart:.2f} seconds")

playaudio("Enhanced Sample #1 (MetricGAN+)", samples[0].enhancedwav)

avgwernoisy = sum(r[2] for r in results) / len(results)
avgwerenhanced = sum(r[4] for r in results) / len(results)

print("n📊 Summary:")
print(f"Average WER (Noisy):    {avgwernoisy:.3f}")
print(f"Average WER (Enhanced): {avgwer_enhanced:.3f}")
print("Tip: Experiment with different noise levels, longer sentences, or switch to GPU for faster processing.")

Conclusion: Harnessing SpeechBrain for Robust Speech Processing

This tutorial highlights the effectiveness of combining speech enhancement and recognition within a single pipeline using SpeechBrain. By synthesizing audio, simulating noisy environments, applying MetricGAN+ for denoising, and transcribing with a powerful ASR model, we observe significant improvements in transcription accuracy under adverse conditions. This open-source framework offers a flexible foundation for expanding to larger datasets, experimenting with alternative enhancement techniques, or customizing ASR models for specific applications.

Loading…

Here are the results for the search: "{{td_search_query}}"

No results!

{{post_title}}

Building an Advanced Speech Enhancement and Recognition Pipeline with SpeechBrain

Setting Up the Environment and Dependencies

Utility Functions for Speech Synthesis, Noise Injection, and Playback

Generating Speech Samples and Introducing Noise

Loading Pretrained Models for Enhancement and Recognition

Enhancing Audio and Evaluating Recognition Performance

Presenting Results and Insights

Conclusion: Harnessing SpeechBrain for Robust Speech Processing

RELATED ARTICLES

The AI lab revolving door spins ever faster

A Coding Guide to Build a Procedural Memory Agent That Learns,...

Mistral AI Ships Devstral 2 Coding Models And Mistral Vibe CLI...