Yourbench: Beyond generic benchmarks

Learn More


Every AI model release includes charts touting how it outperformed its competitors in this benchmark test or that evaluation matrix. Learn More


Each AI model release includes charts that show how it outperforms its competitors in benchmark tests or evaluation matrixes.

These benchmarks test general capabilities. It’s difficult to determine how well large language models-based agents and models understand the needs of organizations who want to use them.

Model repository Launch of Hugging Face Yourbenchis an open-source tool that allows developers and enterprises to create their own benchmarks for testing model performance against internal data.

Sumuk Shashidhar announced Yourbench, a tool created by the evaluations team at Hugging face. on XThis feature allows “custom benchmarking, and synthetic data generation based on ANY of your documents.” It’s an important step in improving the way model evaluations are done.

He added that Hugging Face knows “that for many use cases what really matters is how well a model performs your specific task. Yourbench lets you evaluate models on what matters to you.”

Hugging Face

Creating custom evaluations Yourbench requires that organizations pre-process documents. This involves three steps:

  • document ingestion for “normalizing” file formats.
  • Semantic Chunking ( ) to break down documents to meet context window limitations and focus the model’s interest.
  • Document Summary

The next step is the question-and answer generation process. This creates questions based on information in the documents. The user will then bring in their LLM and see which one answers the best the questions.

Yourbench was tested by Hugging Face with DeepSeek R1 and R1 models. Also tested were Qwen models from Alibaba, including the reasoning model, Qwen QwQ.

Shashidhar stated that Hugging Face offers cost analyses on the models. They found Qwen and Gemini Flash “produce tremendous values for very very low prices.”

Compute limitations

Creating custom LLM benchmarks using an organization’s documents is expensive. Yourbench is a very computationally intensive program. Shashidhar stated on X that they are “adding capacity” to the company as fast as possible.

Hugging Face runs several GPUs and partners with companies like Google to use Their cloud services to inference tasks. VentureBeat contacted Hugging Face to inquire about Yourbench’s compute usage.

Benchmarking does not work perfectly

Benchmarks, and other evaluation methods, give users a good idea of how models perform. However, they do not accurately reflect how models will function on a daily basis.

Some people have expressed skepticism about benchmark tests showing models’ limitations, and how they can lead to false conclusions regarding their safety and performance. A study warned that benchmarking agents may be “misleading.” This has led to the development of different methods for testing model performance and reliability.

Google DeepMind launched FACTS grounding, a test that measures a model’s accuracy in generating factual responses based on documents. Researchers from Tsinghua and Yale Universities developed self-invoking benchmarks for code to help enterprises decide which coding LLMs are best for them.

VB Daily provides daily insights on business use-cases

Want to impress your boss? VB Daily can help. We provide you with the inside scoop on what companies do with generative AI. From regulatory shifts to practical implementations, we give you the insights you need to maximize ROI.

Read our privacy policy

Thank you for subscribing. Click here to view more VB Newsletters.

An error occured.


www.aiobserver.co

More from this stream

Recomended