- DeepSeek V3.2-Exp (Sep 2025) — 69.26
- Kimi K2 Thinking (Nov 2025) — 66.19
- Gemini 2.5 Flash (May 2025) — 58.73
- Qwen 3 Max (Jul 2025) — 55.56
- Grok 4.1 (Nov 2025) — 22.18
- GPT-5 (Aug 2025) — 21.25
- o3 (Apr 2025) — 20.39
- Gemini 2.5 Pro (Mar 2025) — 19.98
- Gemini 3 Pro (Nov 2025) — 19.82
- Claude 3.5 Sonnet (Aug 2025) — 10.17
- GPT-5 Pro (Aug 2025) — 1.96
You must log in or register to comment.
AI Model Efficiency Index 2.1 — Methodology Summary
Goal: Rank AI models by real-world value (performance per dollar) using harder, less-contaminated benchmarks.
Benchmarks Used (8 metrics):
- 20% SWE-bench – real-world coding tasks (repo-level bug fixes)
- 15% MMLU-Pro – harder general knowledge (resists saturation)
- 15% Humanity’s Last Exam – extremely difficult academic reasoning
- 15% GPQA Diamond – PhD-level science questions
- 10% ARC-AGI – abstract reasoning and problem-solving
- 15% Chatbot Arena Elo – human preference (crowdsourced rankings)
- 10% RULER – long-context robustness (32k–128k tokens)
- 10% EQBench – emotional intelligence and creative quality
Why This Mix?
- Reduces gaming and contamination (avoids relying on easy, memorized benchmarks like vanilla MMLU).
- Captures multiple capability dimensions: coding, reasoning, long-context, human preference, and creativity.
- Harder benchmarks are less saturated, making score differences meaningful.
Calculation:
- Normalize all 8 benchmark scores to 0–100 scale.
- Compute weighted composite score for each model.
- Divide composite score by blended API cost (3:1 input:output token ratio).
- Rank by efficiency index (higher = better value).
Coverage:
- Includes only models with complete or near-complete data across all 8 metrics.
- Excludes enterprise/niche models (Cohere, AI21, Baichuan) due to incomplete benchmark coverage or opaque pricing.
- All models are 2025 releases with public pricing and APIs.

