News

CoSyn is an open-source tool for making GPT-4V level vision AI available to everyone

August 10, 2025

Researchers at the University of California, Berkeley (19659001)

University of Pennsylvania
Allen Institute for Artificial Intelligence (19459065) has developed a groundbreaking tool which allows open-source AI models to match or exceed the visual understanding abilities of proprietary models such as

GPT-4V (19459065) and

Gemini 1.5 Flash ( ) could reshape the competitive landscape of open and closed AI.

This tool, called

CoSyn (Code Guided Synthesis) addresses a critical bottleneck for AI development – the lack of high-quality data to teach machines how to understand complex visual information such as scientific charts, medical diagrams and financial documents. CoSyn uses existing language models’ coding capabilities to create synthetic training data, rather than scraping images from the web. This practice is fraught with ethical and copyright concerns.

We have, but we lack such data to train a model. In an exclusive interview with VentureBeat, Yue Yang, recent Penn Engineering Ph.D. grad and co-first author, explained that the model cannot be trained without data such as documents, charts, or rich annotations. “Those images are actually more difficult to annotate than natural photos, such as a picture of the dog, cat, or house.”

This breakthrough comes at a time when enterprises are increasingly seeking AI systems that can understand and reason about complex visual information. The work was done during Yang’s intern with the

PRIOR team () at the

Allen Institute for AIis supported by the

Office of the Director of National Intelligence (19459065)

Intelligence Advanced Research Projects Activity ( ) and the

Defense Advanced Research Projects Agency ( )

How synthetic data generation can solve AI’s biggest challenge in training

The problem of training AI to recognize text-rich images is a long-standing issue. Scientific figures, charts, documents, and other scientific images require extensive annotations that are both time-consuming, and expensive. The traditional approach relies on the harvesting of images and their alt text descriptions from the Internet, but this method often produces training data which is superficial and legal problems.

CoSynadopts a fundamentally new approach, recognizing that the majority of text-rich images were created by code — Python scripts create charts, LaTeX renders math equations, HTML creates interfaces for web pages. The team’s insight was that they could reverse this process by using language models to generate the code and then execute it to create realistic synthetic pictures.

One intuition is that these images are actually charts documents. We generate them using Python code. Yang said, “We use latex or Word to create our documents.” “How about we do it the other way around, where we generate the code? The text-only language model has proven to be very good at writing codes.”

Chris Callison Burcha computer science Professor at Penn who coadvised the study, described the approach simply: “This is similar to asking a student who is great at writing to teach someone else how to draw by just describing the drawing.” We’re essentially moving the strengths of open source AI from text to visual.

CoSyn models outperform GPT-4V or Gemini on key benchmarks.

These results are stunning. Using their

CoSyn models, trained on a synthetic dataset of 400,000images and 2.7 millions instruction pairs, achieved the best performance among open-source models and outperformed proprietary models in seven benchmark tests measuring text rich image understanding.

Their 7-billion-parameter model scored an average of 80.9% on the benchmark suite. This was 3.9 percentage points better than the previous best open source model (Llama 3.2.1B). More remarkably, even their “zero-shot” model–trained without any examples from the evaluation datasets–outperformed most open and closed models, demonstrating the transferability of capabilities learned from synthetic data.

CoSyn models outperformed GPT-4V, Gemini 1.5 and Gemini 1.5 flash across seven text-rich benchmarks. (Credit github.io/cosyn).

One particularly compelling demonstration was created by the researchers. They created a benchmark called

NutritionQA is a set of 100 questions based on the images of nutrition labels. Their model outperformed those trained on millions of images using only 7,000 synthetically created nutrition labels. The researchers noted in their paper that despite being trained on millions images, open-source VLMs were not data-efficient. They performed poorly on this novel task when compared to GPT-4V.

Yang stressed the importance: “Those big packages, they have a lot of resources to collect data to run many experiments. But I think open-source models, we can provide access to people, model weights, data we trained, even the code, training script, all things that developers can build on.”

Vision AI is being used by real companies for quality control and automation.

This technology has already found applications in many industries. Callison Burch gave an example of one of his teaching assistants’ company that uses vision-language modeling for cable installation quality assessment: “They have workers on the site who are installing cables take photos of the process they’re doing, and they use this to automatically validate that each steps has been followed correctly.”

Using specialized visual understanding can transform many enterprise workflows from automated document processing in the financial services to quality controls in manufacturing. Companies can create AI systems that are tailored to their needs by using synthetic data to train models for specific visual tasks.

The research suggests that enterprise decision makers should rethink their AI data strategy. “I believe synthetic data is an excellent way to eliminate the need for human annotation. It is cheaper, it generates large-scale data automatically, and can also avoid some copyright problems,” Yang noted.

The persona-driven method that ensures data diversity in AI training

is one of CoSyn’s key innovations. The system uses a “persona driven mechanism” to prevent repetitive outputs, which are common in AI generated content. Each time CoSyn creates a synthesized example, it pairs a request with a randomly selected persona. For instance, “every time we generate syntax data, we’ll appear with a random sampled persona,” Yang explained. This will diversify the styles and content of the examples generated. For example, if I give the persona of a PhD student it will generate something that is more academic or scientific. The researchers used 11 rendering tools, ranging from Python’s

Matplotlibis a tool for creating charts.

LaTeX is a mathematical expressions language supported by 20 generation pipelines.

Why this breakthrough could equalize the playing field between Big Tech and open source

There are significant implications for the AI industry as a whole. Major technology companies such as

OpenAI
Google has invested billions of dollars in developing their proprietary vision language capabilities. They have created systems whose methods of training and data sources are trade secrets.

CoSyn (19459065) offers an alternative to closed-source software that allows open-source solutions to compete without the need for similar resources.

“Open source models are still behind closed source models but with the efforts and resources of the open source community we have made more efforts. We have more energy, like, from everyone. “I think we are finally catching up,” Yang said.

Openness is more than just releasing a model. The complete

CoSyn codebase ,

The 400,000-image dataset (19459065) is available for download.

All training scripts have been made publicly available to allow researchers and companies around the world to build on this work. Yang said, “From the academic side, a lot research is built on openness. We need to have access to all data, code and everything to discover new discoveries to support our claims made in papers.”

The transparency of the proprietary AI systems is a growing concern. “If you rely only on APIs for open AI, it may not be reliable as a way to prove your scientific discoveries because they could just. Yang pointed out that you never know what’s going on behind the scenes.

Beyond static Image Understanding

CoSyn pioneers capabilities that are crucial for the next-generation of AI agents – systems that can navigate digital interfaces autonomously and perform complex tasks. The researchers created synthetic “pointing data,” which teaches models where to click on screenshots. This is a requirement for web automation.

Their model achieved the best performance on the web using 65,000 synthetic screenshots and click annotations.

ScreenSpot a benchmark for clicking prediction, outperforms systems trained on 1.3 millions real screenshots. “We can outperform the previous model by millions of screenshots, even though we only use a few hundred synthetic screenshots,” Yang said.

As the industry moves towards AI agents that can perform autonomous knowledge work, this capability is essential. Callison Burch said that there are two main models for implementing agents. One approach relies on specialized APIs while the other relies upon agents that “literally use web browsing abilities in the same manner as you and I do.” Click on the mouse and then think about where you want to click.

How synthetic data can help to solve the growing copyright crisis that is affecting AI training

Synthetic data also offers a solution for mounting legal challenges relating to AI training data. Synthetic data generation is a viable alternative to the ongoing litigation regarding whether copyrighted material can be used for training.

Callison Burch, who

Testified before Congress in 2023 on AI and Copyrighthe sees synthetic data as complementary, rather than substituting, real-world data for training AI systems: “I don’t think that synthetics data eliminates the necessity of having large amounts of diverse training information like that’s still an essential element to training AI system, but it allows you to extend their abilities in really remarkable ways.” “The language model is the underlying thing we’re relying upon here. Can write code that is something it learned from its initial data. We’re now using that for a completely different application, which involves creating new training data that are unlike any of the data it was trained on.”

Synthetic data generation has important limitations despite its promise. Yang acknowledged that “one limitation is that it may inherit biases from a model which generates such synthetic information.” The system can also struggle to deal with diversity: “If a large network is asked to generate data across different runs, it could generate similar data.” “What about some natural images like other real photos?” Yang said that it is difficult to generate synthetic data from chest X-rays or medical images of those two men, although she noted ongoing efforts to extend this approach to medical imaging.

Yang anticipates that synthetic data generation will become a standard practice in the future: “In two or three year’s time, even for free, editor will be a very important part of teaching model different capabilities.” She also noted that optimal results may require combining real-world and synthetic data: “Real-world data will reflect certain real world distributions.” Single data can have a large scale. Can be more controllable.” “I heard that companies, such as meta, some teams, and all Amazon are using our data to train a model,” Yang revealed in the interview.

The cost advantages for startups and smaller companies could be significant. “For some startups it is cheaper to host their host open model on the server than just calling APIs, which are less controllable,” Yang said.

This decision by the research team to make everything open-source reflects a wider philosophy about AI development. The commitment to open science is still at the core of the Allen Institute’s mission as Yang prepares to start working full-time there after completing her Ph.D. “At the moment, these vision language models are very fragile. She said that it only needs the right data in order to achieve the desired capabilities. “If you find the correct data, you can improve the models capabilities on it, which will benefit society.”

The vision of AI that does more than describe

As research moves from the academic labs to real-world applications the implications go beyond better benchmark scores. Yang and her co-workers are already focusing on applications that could change the way people with disabilities interact technology. From AI that understands signs for the hearing impaired, to systems that describe complex medical images for individuals with visual impairments.

Yang described future applications by saying, “I have an Idea to let the model know how to understand sign language or people with hearing disabilities.” “If you can find the right data you can improve the models capability, and it will be beneficial to the society.” Yang has worked on this project at the Allen Institute. Ocean of creating simulated robot training data.”

This work is more than a technical accomplishment–it shows that open-source AI can compete with well-funded efforts from major technology companies by using innovative approaches to fundamental problems. Yang, in her reflection on her decision to accept a higher-paying offer from Meta, said: “I believe it’s still very early stages of those multimodal model, and there is not much open resources, knowledge, or resources to share with the community.”

VB Daily provides daily insights on business use-cases

Want to impress your boss? VB Daily can help. We provide you with the inside scoop on what companies do with generative AI. From regulatory shifts to practical implementations, we give you the insights you need to maximize ROI.

Read our privacy policy

Thank you for subscribing. Click here to view more VB Newsletters.

An error occured.

Loading…

Here are the results for the search: "{{td_search_query}}"

No results!

{{post_title}}

How synthetic data generation can solve AI’s biggest challenge in training

CoSyn models outperform GPT-4V or Gemini on key benchmarks.

Vision AI is being used by real companies for quality control and automation.

The persona-driven method that ensures data diversity in AI training

Why this breakthrough could equalize the playing field between Big Tech and open source

How synthetic data can help to solve the growing copyright crisis that is affecting AI training

The vision of AI that does more than describe

RELATED ARTICLES

The AI lab revolving door spins ever faster

A Coding Guide to Build a Procedural Memory Agent That Learns,...

Mistral AI Ships Devstral 2 Coding Models And Mistral Vibe CLI...