Home News How much information do LLMs really memorize? Now we know, thanks to...

How much information do LLMs really memorize? Now we know, thanks to Meta, Google, Nvidia and Cornell

0
How much information do LLMs really memorize? Now we know, thanks to Meta, Google, Nvidia and Cornell

How much information can LLMs actually memorize? Meta, Google, Nvidia, and Cornell have helped us to find out.

Pop Art colorful dark blue background with white, orange and red human brain surrounded green and blue and rusty orbs and lines.

Join the event trusted for over two decades by business leaders. VB Transform brings the people who are building enterprise AI strategies together. Learn more

Most people who are interested in generative AI already know that Large Language Models, like those behind ChatGPT and Anthropic’s Claude and Google’s Gemini, are trained using massive datasets. These include trillions of words extracted from websites, codebases, books, and increasingly other media, such as images, audio and video. Why?

LLMs learn to generalize their statistical understanding of language and its patterns from this data. This understanding is encoded as billions of “settings” in a network artificial neurons (which are mathematical operations that transform input data into signals).

LLMs are exposed to all of this training data and learn to detect patterns that are reflected by the parameters in their neurons. The word “apple”for example, appears often near terms related food, trees, fruit and even computers. The model recognizes that apples are edible, can be red, yellow, green or even other colors when rotten or rare. They are also spelled in English as “a-p.p.l-e”. This statistical knowledge affects how the model responds to a user’s input — shaping the output based on the associations that it “learned”.

However, a question that remains among AI researchers is: How much of the LLM’s data is used for building generic concepts and how much data is memorized or stored in an identical or near identical way to the original data.

It is important to understand how LLMs work — and what goes wrong — as well as defend model providers in copyright infringement suits brought by data creators or owners, such artists and record labels. If LLMs are found to copy significant portions of the training data verbatim then courts may be more inclined to agree with plaintiffs who claim that the models have illegally copied protected material. If not, if the models generate outputs that are based on generalized patterns instead of exact replication, developers may be allowed to continue scraping copyrighted data and training using existing legal defenses like fair use.

We finally have an answer for the question of how LLMs memorize or generalize. Researchers at Meta, Google DeepMind and Cornell University have released a study this week that found GPT models have a fixed memory capacity of approximately 3.6 bit per parameter.

In order to understand what 3.6 bit means in practice,

  • a single bit is the smallest digital data unit that represents either a 0 (zero) or a 1 (one). One byte is made up of eight bits.
  • Storing just 3.6 bits can store approximately 12.13 different values, as calculated using 23.6. This is the amount of data needed to select one of twelve options, similar to selecting the month of the calendar year or the result of a 12-sided dice. It is not enough to store one English letter, which requires 4.7 bits. But it is enough to encode an English character from a set of 10 letters that are common (which only needs 3.32 bits).
  • 3.6 bits are 0.45 bytes in bytes. This is less than half the size of an ASCII character (which uses 8 bit or 1 byte).

The number is independent of the model within reasonable architectural variations. Different depths, widths and precisions produce similar results. The estimate remained constant across model sizes, and even precision levels. Full-precision models reached slightly higher values (upto 3.83 bits/parameter).

More data does not lead to better memorization. In fact, models are less likely to memorize a single data point.

The research shows that models do NOT memorize more data when they have been trained with more data. The fixed capacity of a model is distributed over the dataset. Each datapoint therefore receives less attention.

Jack Morris is the lead author. The social network X explained that “training models on more data will make them memorize less per sample.”

This finding may help ease concerns about large models memorizing sensitive or copyrighted content.

The likelihood of reproducing a specific training example will decrease if memorization is limited, diluted and spread across many examples. In essence, more data from training leads to safer generalization behaviors, not increased risks.

Researchers’ findings

To quantify exactly how much language models remember, the researchers used a powerful but unconventional approach: They trained transformer models on datasets consisting of uniformly random bitsstringsEach bitstring was sampled individually to ensure that there were no patterns, structures, or redundancies across examples.

Each sample is unique, and has no shared features. Any ability the model displays in identifying or reconstructing these strings during evaluation directly indicates how much information was retainedduring training.

This setup was designed to eliminate any possibility of generalization. Uniform random data does not contain any of the information found in natural language, which is full of grammatical structures, semantic overlaps, and repeating ideas. Each example is essentially noise with no statistical relationship between them. In this scenario, the model’s performance on test data will be purely based on memorization, as there is no distributional pattern from which to generalize.

According to the authors, their method is perhaps one of the few principled waysto decouple memory from learning in practice. This is because, when LLMs have been trained on real language and produce outputs that match the training data it’s hard to tell whether they’ve memorized the input, or simply inferred its underlying structure based on the patterns they observed. This method allows researchers to map the relationship between the number model parameters and total information stored. They observed consistent results by gradually increasing the size of models and training each variant until saturation. This was done across hundreds of experiments with models ranging from 50K to 1.5 billion parameter.

They also applied their methodology to models that were trained on real-world datasets. Models that were trained on text showed a balance between memorization and a generalization.

As datasets grew, models began to learn generalizable patterns. This transition was marked with a phenomenon called “double descent,” in which performance temporarily drops before improving once generalization kicks-in.

This study also examined the effect of model precision, comparing training in bfloat16 to float32. The researchers observed a modest increase in bits per parameter from 3.51 to 3.83% when switching to 32-bit precision. This gain is less than what doubling the available bits would suggest. It implies that higher precision has diminishing returns.

Unique data is more likely be memorized.

This paper proposes a scaling rule that relates the model’s capacity, and dataset size, to the effectiveness inference attacks.

These attempts are made to determine if a certain data point was included in a model’s learning set. The research shows that these attacks become less reliable as dataset sizes increase, supporting the argument large-scale training reduces privacy risk.

Although the paper focuses primarily on average-case behavior some researchers have pointed that certain types data, such as highly unique or styled writing, may still be more susceptible for memorization.

While acknowledging this limitation, the authors emphasize that their method was designed to characterize trends in general rather than edge cases.

Moving towards greater human understanding of LLM understanding.

The study introduces a principled, quantifiable definition for memorization. This gives researchers and developers new tools to evaluate the behavior of language model. This is helpful not only for model transparency, but also for compliance, privacy and ethical standards when developing AI. The findings suggest that training large-scale models with more data, and not less, may be the safer option.

To give a sense of the total model memory:

  • a 500K-parameter can memorize approximately 1.8 million bits or 225KB of data.
  • An 1.5 billion parameter model is capable of storing approximately 5.4 billion bits or 675 megabytes raw information.
  • While this is not comparable to the typical file storage of images (e.g. a 3.6MB uncompressed picture is about 30,000,000 bits), it is significant if distributed across discrete textual pattern.

Although I am not a lawyer or legal expert I would expect that such research would be cited in numerous ongoing lawsuits between AI provider and data creators/rights holders.

Daily insights into business use cases from VB Daily

Want to impress your boss? VB Daily can help. We provide you with the inside scoop on what companies do with generative AI. From regulatory shifts to practical implementations, we give you the insights you need to maximize ROI.

Read our privacy policy

Thank you for subscribing. Click here to view more VB Newsletters.

An error occured.

www.aiobserver.co

NO COMMENTS

LEAVE A REPLY

Please enter your comment!
Please enter your name here

Exit mobile version