Nvidia launches open-source transcription AI model Parakeet -TDT -V2 with Hugging Face

(19659001)

Credit: VentureBeat created with Midjourney

Learn More

Subscribe to our daily and weekly emails for the latest updates on industry-leading AI content. Learn More


Nvidia is now the world’s leading AI company. In recent years, Nvidia has become one of the world’s most valuable companies thanks to the stock exchange noticing the demand for graphics processing units (GPUs), powerful chips Nvidia manufactures that are used not only to render graphics in games, but also to train AI large language models and diffusion models. Nvidia is more than just hardware and software. As the generative AI age wears on, Santa Clara’s company has been steadily releasing its own AI models. These are mostly open source, free to download, modify, and use commercially. The latest among them is The parakeet tdt 0.6bis an automatic speech recognition model that can be used to recognize words and phrases. Vaibhav Srivastav of Hugging Face Says, “transcribe 60 minute audio in one second [mind blown emoji].”

The new generation of Parakeet Nvidia model was first unveiled in January 2024. It was updated again in April of that year (19459050) is the top version, but the second version is even more powerful. Hugging Face ASR Leaderboard has a “Word error rate” (times that the model incorrectly transcribing a spoken word), of only 6.05% (out 100).

This is a close match to proprietary transcription models like OpenAI’s GPT-4o transcribe (with an error rate of 2.46% for English) and ElevenLabs Scribe (3.3%).

It’s doing all this while being freely available under a commercially permitted license. Creative Commons CCBY-4.0 license (19459050) is a license that allows commercial enterprises and independent developers to integrate speech recognition and transcription into their paid applications.

Performance and benchmark standing.

This model has 600 million parameters, and uses a combination of FastConformer decoder and TDT encoder architectures. It can transcribe an hour of audio within one second if it is running on Nvidia GPU-accelerated hardware.

This performance benchmark has an RTFx of 3386.02 and a batch size 128, which places it at the top in current ASR benchmarks maintained Hugging Face.

Use cases and availability Parakeet TDT 0.6B v2 will be released globally on May 1, 2020. It is designed for developers, researchers and industry teams who are building applications like transcription services, voice assistants and conversational AI platforms.

This model includes punctuation, capitalization and word-level timestamping. It is a complete transcription package that can be used for a variety of speech-to text needs.

Developers can deploy models using Nvidia’s NeMo Toolkit. The setup process works with Python and PyTorch. The model can be used for specific tasks or fine-tuned to suit a particular domain.

Open-source licenses (CC-BY 4.0) allow for commercial use. This makes it attractive to both startups and enterprises.

Training data and model creation

Parakeet TDT-0.6B was trained using a large and diverse corpus known as the Granary dataset. This includes approximately 120,000 hours (of English audio) of high-quality data transcribed by humans and 110,000 hours (of pseudo-labeled speech).

Sources include well-known datasets such as LibriSpeech, Mozilla Common Voice, YouTube-Commons, and Librilight.

Nvidia intends to make the Granary dataset public after its presentation at Interspeech 2020.

Evaluation and robustness

This model was evaluated against multiple English-language ASR Benchmarks, including AMI Earnings22 GigaSpeech and SPGISpeech and showed strong generalization. It is robust in a variety of noise conditions, and performs well with telephony audio formats. Only modest degradation occurs at lower signal to noise ratios.

Hardware compatibility and efficiency.

The Parakeet TDT-0.6B-v2 has been optimized for Nvidia GPU environments. It supports hardware such as A100, H100 and V100 boards.

High-end GPUs optimize performance, but the model can be loaded onto systems with as little 2GB of RAM. This allows for a wider range of deployment scenarios.

Ethics and responsible use

NVIDIA states that the model was created without the use of any personal data, and adheres to their responsible AI framework.

Despite the fact that no specific measures to mitigate demographic bias were taken, the model met internal quality standards, and included detailed documentation about its training process, dataset origination, and privacy compliance.

This release attracted the attention of the machine learning and open source communities, particularly after it was highlighted publicly on social media. Commentators praised the model for its ability to outperform ASR commercial alternatives while remaining open source and commercially useable. Developers who are interested in the model can try it via Hugging Faceis also available through Nvidia’s NeMo Toolkit. Installation instructions, demos scripts, integration guidance, and installation instructions are readily available for experimentation and deployment.

VB Daily provides daily insights on business use-cases

Want to impress your boss? VB Daily can help. We provide you with the inside scoop on what companies do with generative AI. From regulatory shifts to practical implementations, we give you the insights you need to maximize ROI.

Read our privacy policy

Thank you for subscribing. Click here to view more VB Newsletters.

An error occured.

www.aiobserver.co

More from this stream

Recomended