At this point, we’ve all heard plenty of stories about AI chatbots leading users to harmful actions, harmful beliefs, or simply incorrect information. Despite the prevalence of these stories, though, it’s hard to know just how often users are being manipulated. Are these tales of AI harms anecdotal outliers or signs of a frighteningly common problem? Anthropic took a stab at answer ingthat question this week, releasing a paper studying the potential for what it calls “disempowering patterns” across 1.5 million anonymized real-world conversations with its Claude AI model. While the results show that these kinds of manipulative patterns are relatively rare as a percentage of all AI conversations, they still represent a potentially large problem on an absolute basis. A rare but growing problem In the newly published paper “Who’s in Charge? Disempowerment Patterns in Real-World LLM Usage,” researchers from Anthropic and the University of Toronto try to quantify the potential for a specific set of “user disempowering” harms by identifying three primary ways that a chatbot can negatively impact a user’s thoughts or actions: Reality distortion: Their beliefs about reality become less accurate (e.g., a chatbot validates their belief in a conspiracy theory) Belief distortion: Their value judgments shift away from those they actually hold (e.g., a user begins to see a relationship as “manipulative” based on Claude’s evaluation) Action distortion: Their actions become misaligned with their values (e.g., a user disregards their instincts and follows Claude-written instructions for confronting their boss) While “severe” examples of potentially disempowering responses are relatively rare, “mild” ones are pretty common. Credit: Anthropic To figure out when a chatbot conversation has the potential to move a user along one of these lines, Anthropic ran nearly 1.5 million Claude conversations through Clio, an automated analysis tool and classification system (tested to make sure it lined up with a smaller subsample of human classifications). That analysis found a “severe risk” of disempowerment potential in anything from 1 in 1,300 conversations (for “reality distortion”) to 1 in 6,000 conversations (for “action distortion”).
Continue reading the complete article on the original source