Benchmarks are the latest frontier of AI hype

We’re sorry to report that ~vibes~ might be the only way to judge AI models.

Driving the news: Anthropic’s Claude AI model knocked GPT-4 from the top of Chatbot Arena, a crowdsourced chatbot ranking site. It’s the first time OpenAI hasn’t been number one on the leaderboard, feeding into debates about how to evaluate the “best” AI model.

  • Visitors to Chatbot Arena interact with unlabelled chatbots and decide which one answers best. More than 477,000 votes have created a leaderboard of 75 AI models.

Catch-up: Announcements about new AI models inevitably come with charts filled with percentages and scores on tests with weird and inscrutable names like “AI2 Reasoning Challenge” and “HellaSwag.” Companies parade these benchmarks to objectively prove their AI is better than the competition.

Yes, but: What these benchmarks measure is pretty opaque to anyone who isn’t a researcher or developer, and there’s growing skepticism about how well they judge an AI’s abilities.

  • Benchmarks don’t always reflect real-world tasks. A chatbot could ace a test of PhD-level science questions, but that doesn’t mean it can write emails to your boss.
  • Tech companies can cherry-pick the results they share, leaving consumers without a full picture.
  • HellaSwag claims to measure AI reasoning skills, but its questions were found to include typos and obvious writing errors.

Why it matters: If you need help deciding which AI to use — and potentially spend a lot of money on — you seem to have two choices: relying on subjective evaluations like Chatbot Arena that are driven by experiences of strangers, or a tech company’s own PR hype.

  • But as TechCrunch laid out, even though subjective reviews have their problems, they can be more transparent and truthful than a press release.

Bottom line: Don’t stress about understanding AI benchmarks, which are more useful for researchers than users. As developer Simon Willison has pointed out, as frustrating as it is, evaluating a chatbot based on vibes is probably closer to what you want to know.