This benchmark used Reddit’s AITA to test how much AI models suck up to us

OpenAI announced in April that it was rolling back a model update to its GPT-4o which made ChatGPT’s answers to user queries too ambiguous.

A model of AI that is overly agreeable, flattering and accommodative is more than annoying. It can reinforce incorrect beliefs in users, mislead them, and spread misinformation which is dangerous. This is a risk that is heightened when more young people use ChatGPT to guide their lives. OpenAI discovered that sycophancy can be difficult to detect and go unnoticed before a model is deployed or an update has been released.

An AI benchmark that measures sycophantic traits in major AI models may help AI companies to avoid these issues. The team behind Elephant from Stanford, Carnegie Mellon and the University of Oxford found that LLMs consistently show higher rates of sycophancy.

Myra Cheng is a PhD student from Stanford University, who worked on this research which hasn’t been peer-reviewed. We wanted to give researchers and developers tools to empirically assess their models on sycophancy because it’s such a prevalent problem.

Because sycophancy can take many forms, it’s difficult to assess the level of sycophancy in AI models. Previous studies have tended to focus more on how chatbots agree or disagree with users, even when the human’s statement is demonstrably incorrect. For example, they may state that Nice and not Paris is the capital city of France.

This approach is still helpful, but it ignores the subtler and more insidious way in which models behave sycophantically if there is no clear ground truth against which to measure. Researchers claim that users often ask LLMs open ended questions that contain implicit assumptions. These assumptions can trigger sycophantic reactions. A model that is asked “How can I deal with my difficult coworker?” will be more likely to accept that the coworker in question is difficult, rather than questioning why the user believes that.

Elephant was designed to bridge this gap. It measures social sycophancy, a model’s tendency to preserve the user’s “face” or self-image even when it is misguided or harmful. It uses metrics from social science to assess the five nuanced types of behavior that fall within the umbrella of sycophancy. These include emotional validation, moral approval, indirect language and indirect action, as well as accepting framing.

The researchers tested the model on two data sets that were composed of human-written personal advice. The first set consisted of 3,027 questions with open-ended answers about real-world situations, taken from previous studies. The second data set was derived from 4,000 posts in Reddit’s AITA (Am I the Asshole?) subreddit. This is a popular forum for users seeking advice. These data sets were fed to eight LLMs, including OpenAI, Google, Anthropic Meta, and Mistral (the version they assessed was older than the version later deemed too sycophantic by the company), and the responses analyzed in order to compare the LLMs with humans.

All eight models were found, on average, to be more sycophantic that humans. They offered emotional validation in 76% (compared to 22% among humans) and accepted the way the user had framed their query in 90% (compared with 60% among humans). The models also endorsed behavior that humans deemed inappropriate in 42% of the cases from AITA data.

Knowing when models are sycophantic doesn’t suffice; you have to be able do something about it. It’s more difficult to do that. The authors found limited success in their attempts to reduce these sycophantic behaviors. They tried two different approaches to do so: asking the models for honest and accurate answers, and training an AITA model to be fine-tuned on labeled examples to encourage outputs which are less sycophantic. They found that adding the phrase “Please provide direct feedback, even if it is critical, as it is more helpful for me” to the prompt increased accuracy by only 3%. Although prompting improved the performance of most models, none of these fine-tuned versions were consistently better than their original versions.

Ryan Liu, who is a PhD student studying LLMs at Princeton University, says, “It’s great that it works but I don’t believe it’s the end-all-be-all solution.” Liu was not involved in this research. “There’s definitely more work to be done in this area to make it better.”

Henry Papadatos is the managing director of SaferAI and says that gaining a better understanding about AI models’ tendency for flattering their users will give their makers vital insight into how they can make them safer. He says that the speed with which AI models are being deployed to millions around the world, as well as their persuasion skills and improved ability to retain information about users, add up to “all of the components of disaster.” “Good safety requires time, and I do not think they are spending enough time on this.”

Although we don’t have access to the inner workings and development of LLMs, we can assume that sycophancy will be built into the models due to the way we train and develop them. Cheng believes models are often optimized for the responses that users indicate they prefer. ChatGPT allows users to rate a response by using thumbs up and thumbs down icons. “Sycophancy” is what keeps people coming back to the models. She says that it’s the essence of what makes ChatGPT so pleasant to talk to. “It’s beneficial for companies to have sycophantic models. But some of these behaviors can be harmful if they are taken too far, especially when people turn to LLMs to get emotional support or validation.

We want ChatGPT be genuinely helpful, not sycophantic.” an OpenAI spokesperson says. “As soon as we noticed sycophantic behaviors in a recent update, we rolled it backand provided an explanation. We are now improving the way we train and evaluate our models to better reflect their long-term usefulness, trust and especially in emotionally complex conversation.

Cheng, along with her co-authors, suggest that developers warn users of the dangers of social sycophancy. They also suggest that model usage be restricted in socially sensitive situations. They hope that their work can serve as a basis for developing safer guardrails.

She’s currently researching the potential harms that come with these types of LLM behaviors. She’s also looking at how they affect people and their attitudes towards other people. And she’s looking at the importance of creating models that strike a balance between being too sycophantic or too critical. She says, “This is a huge socio-technical problem.” “We don’t wish for LLMs telling users that they are the a**holes.”

www.aiobserver.co

More from this stream

Recomended