Voronoi logo

Leading AI Models Show Persistent Hallucinations Despite Accuracy Gains

Leading AI Models Show Persistent Hallucinations Despite Accuracy Gains

Recent tests by the European Broadcasting Union found that artificial intelligence assistants misrepresented news content in 45 percent of evaluated cases across languages and regions, highlighting persistent concerns about accuracy as AI adoption grows.

The EBU results underscore the importance of evaluating AI performance across a wider range of systems. To track this, Artificial Analysis maintains continuously updated data on leading models, with a snapshot captured by Digital Information World, on 1 December 2025 reflecting current trends in accuracy and hallucination rates. These results show what users encounter in real-world deployments rather than theoretical benchmarks.

The analysis measures two core metrics. Accuracy reflects the proportion of correct answers, while hallucination rate captures how often a model provides an incorrect response when it should refuse or indicate uncertainty. Together, these metrics provide a clearer picture of reliability across different AI systems.

Hallucination rates vary widely. Claude 4.5 Haiku reports the lowest rate at 26 percent, followed by Claude 4.5 Sonnet at 48 percent and GPT-5.1 (High) at 51 percent. Claude Opus 4.5 reaches 58 percent.

Other models perform worse. Grok 4 records 64 percent, Kimi K2 0905 69 percent, and Grok 4.1 Fast 72 percent. Kimi K2 Thinking reaches 74 percent, and Llama Nemotron Super 49B v1.5 76 percent.

DeepSeek models are among the least reliable. V3.2 Ex records 81 percent, R1 0528 83 percent, and EXAONE 4.032B 86 percent. Llama 4 Maverick posts 87.58 percent, while multiple Gemini variants exceed 87 percent. GLM-4.6 and gpt-oss-20B (High) top the chart above 93 percent.

Accuracy remains limited. Gemini 3 Preview (High) leads at 54 percent, followed by Claude Opus 4.5 at 43 percent and Grok 4 at 40 percent. Gemini 2.5 Pro reaches 37 percent, GPT-5.1 (High) 35 percent, and Claude 4.5 Sonnet 31 percent.

Most other models fall into the twenties or teens, showing that higher accuracy does not automatically prevent frequent errors.

These findings illustrate that AI assistants continue to face significant challenges in providing consistent and reliable responses, emphasizing the need for ongoing monitoring and careful deployment in real-world settings.

Leading AI Models Show Persistent Hallucinations Despite Accuracy Gains - Voronoi