Google AI Overviews Makes Few Errors, But a Lot: 57 Million Incorrect Answers Per Hour

An analysis by the New York Times published in recent days has highlighted one of Google's most important strategic assets: AI Overviews, the answer boxes generated by Gemini that appear at the top of search results. The numerical conclusion appears to be reassuring, but a closer look reveals a more concerning scale issue: AI Overviews are correct about 90-91% of the time, but with five trillion annual searches processed by Google, that 10% error rate results in over 57 million incorrect answers every hour.

The benchmark used in the study is SimpleQA, developed by the AI research startup Oumi, which includes over 4,000 questions with verifiable factual answers. The results show a progressive improvement: with Gemini 2 (October 2025), the AI Overviews accurately answered 85% of questions; with Gemini 3 (February 2026), the rate increased to 91%. This is real progress, but insufficient to contain the absolute volume of errors generated by the global scale of the search engine.

Google vs Oumi: Two Measures, Two Realities

The most significant fracture emerging from the investigation is not about the numbers themselves but about who produces them and with what methods. For Oumi, SimpleQA is an industry-recognized standard for evaluating the factual accuracy of generative AI models: over 4,000 questions with unique and verifiable answers designed to measure how much a system trusts itself concerning concrete statements. Applied to AI Overviews with Gemini 3, the result indicates an error rate of 9-10%.

This view is contested by Google on two fronts. The first is methodological: according to the company, SimpleQA contains its own errors and, crucially, does not reflect the actual distribution of user queries since those who search on Google rarely pose binary and verifiable questions like those in the benchmark. The second front is even more uncomfortable: internal data from Google, highlighted by the NYT's journalistic investigation and not voluntarily disclosed by the company, indicates that its proprietary assessment, based on a more restricted dataset called SimpleQA Verified, with more stringent validated answers, places the error rate at 28%. This figure is three times higher than Oumi's, produced with a tool that Google itself considers more reliable than the external benchmark. However, what emerges is a contradiction that raises more than a few eyebrows: Mountain View claims that SimpleQA overestimates errors while its internal data shows an even worse picture.

Google has reaffirmed that AI Overviews are accompanied by links to sources and an explicit warning at the bottom of the box: "AI responses may contain errors." The official stance is that this warning is sufficient to inform users. Pratik Verma of Okahu, mentioned in the NYT investigation, noted that Google's technology "is comparable to other leading AI systems": a statement that sounds like a defense but, in fact, only highlights that all cutting-edge language models hallucinate at significant rates, and none were originally designed to respond to five trillion searches per year as the primary information publisher.

The Phenomenon of "Ungrounded" Responses

Beyond the benchmark debate, there is a third data point that neither party disputes and which is likely the most insidious: that of so-called "ungrounded" responses. These are answers that may be technically correct, but whose cited sources do not actually support the assertion made. In October 2025, 37% of correct responses were "ungrounded." By February 2026, that percentage rose to 56%, despite the model becoming overall more accurate. To recalibrate these numbers, it means that more than one in two answers that Google presents as correct cannot be verified by clicking on the indicated sources. The citation system, which should allow the user to trace back to the original information, is decorative in most cases.

Among the 5,380 sources analyzed in the Oumi study, Facebook and Reddit ranked as the second and fourth most cited domains in AI Overviews, respectively. In incorrect responses, Facebook was cited in 7% of cases, compared to 5% of correct answers. The model does not reliably distinguish between an academic source and a post in a Facebook group, nor between an official page and a Reddit thread.

Documented Errors and the Case of Medical Queries

The NYT investigation documented specific and verifiable errors. AI Overviews indicated 1987 as the year the Bob Marley Museum opened (the correct date is May 11, 1986), provided information about Hulk Hogan's alleged death without noting the contradiction with news articles visible just below the AI box, and incorrectly named the river flowing west of Goldsboro, NC, indicating the Neuse River instead of the Little River.

However, the most critical context remains that of medical queries. A January 2026 investigation by the Guardian, cited in the NYT article, had already documented that AI Overviews provided dangerous health advice in 44% of analyzed medical searches, including incorrect guidance for cancer patients and misleading interpretations of liver function tests. Google responded by removing AI Overviews from a subset of specific health queries, without making public the excluded query list or the selection criteria used.

The Scale Issue That No Method Answers

Whether one accepts Oumi's 10% or Google's 28%, the underlying problem remains unchanged: both figures, multiplied by the scale of the world's most used search engine, produce an unprecedented volume of misinformation in media history. When Google introduced AI Overviews in 2024, it transformed its role from an aggregator of links to a direct publisher of content. This transition has shifted the responsibility for the accuracy of responses onto Google itself, a responsibility that the numbers emerging from the NYT investigation, regardless of which benchmark one chooses to believe, suggest is not yet adequately managed given the scale of the system.