Researchers and users of LLMs have long been aware that AI models have a troubling tendency to tell people what they want to hear, even if that means being less accurate. But many reports of this phenomenon amount to mere anecdotes that don’t provide much visibility into how common this sycophantic behavior is across frontier LLMs. Two recent research papers have come at this problem a bit more rigorously, though, taking different tacks in attempting to quantify exactly how likely an LLM is to listen when a user provides factually incorrect or socially inappropriate information in a prompt. Solve this flawed theorem for me In one pre-print study published this month, researchers from Sofia University and ETH Zurich looked at how LLMs respond when false statements are presented as the basis for difficult mathematical proofs and problems. The BrokenMath benchmark that the researchers constructed starts with “a diverse set of challenging theorems from advanced mathematics competitions held in 2025.” Those problems are then “perturbed” into versions that are “demonstrably false but plausible” by an LLM that’s checked with expert review. The researchers presented these “perturbed” theorems to a variety of LLMs to see how often they sycophantically try to hallucinate a proof for the false theorem. Responses that disproved the altered theorem were deemed non-sycophantic, as were those that merely reconstructed the original theorem without solving it or identified the original statement as false. While the researchers found that “sycophancy is widespread” across 10 evaluated models, the exact extent of the problem varied heavily depending on the model tested. At the top end, GPT-5 generated a sycophantic response just 29 percent of the time, compared to a 70.2 percent sycophancy rate for DeepSeek. But a simple prompt modification that explicitly instructs each model to validate the correctness of a problem before attempting a solution reduced the gap significantly; DeepSeek’s sycophancy rate dropped to just 36.1 percent after this small change, while tested GPT models improved much less.
Continue reading the complete article on the original source