1. DeepSeek V3.2-Exp (Sep 2025) — 69.26
  2. Kimi K2 Thinking (Nov 2025) — 66.19
  3. Gemini 2.5 Flash (May 2025) — 58.73
  4. Qwen 3 Max (Jul 2025) — 55.56
  5. Grok 4.1 (Nov 2025) — 22.18
  6. GPT-5 (Aug 2025) — 21.25
  7. o3 (Apr 2025) — 20.39
  8. Gemini 2.5 Pro (Mar 2025) — 19.98
  9. Gemini 3 Pro (Nov 2025) — 19.82
  10. Claude 3.5 Sonnet (Aug 2025) — 10.17
  11. GPT-5 Pro (Aug 2025) — 1.96
  • jaykrown@lemmy.worldOPM
    link
    fedilink
    arrow-up
    1
    arrow-down
    1
    ·
    2 months ago

    AI Model Efficiency Index 2.1 — Methodology Summary

    Goal: Rank AI models by real-world value (performance per dollar) using harder, less-contaminated benchmarks.

    Benchmarks Used (8 metrics):

    • 20% SWE-bench – real-world coding tasks (repo-level bug fixes)
    • 15% MMLU-Pro – harder general knowledge (resists saturation)
    • 15% Humanity’s Last Exam – extremely difficult academic reasoning
    • 15% GPQA Diamond – PhD-level science questions
    • 10% ARC-AGI – abstract reasoning and problem-solving
    • 15% Chatbot Arena Elo – human preference (crowdsourced rankings)
    • 10% RULER – long-context robustness (32k–128k tokens)
    • 10% EQBench – emotional intelligence and creative quality

    Why This Mix?

    • Reduces gaming and contamination (avoids relying on easy, memorized benchmarks like vanilla MMLU).
    • Captures multiple capability dimensions: coding, reasoning, long-context, human preference, and creativity.
    • Harder benchmarks are less saturated, making score differences meaningful.

    Calculation:

    1. Normalize all 8 benchmark scores to 0–100 scale.
    2. Compute weighted composite score for each model.
    3. Divide composite score by blended API cost (3:1 input:output token ratio).
    4. Rank by efficiency index (higher = better value).

    Coverage:

    • Includes only models with complete or near-complete data across all 8 metrics.
    • Excludes enterprise/niche models (Cohere, AI21, Baichuan) due to incomplete benchmark coverage or opaque pricing.
    • All models are 2025 releases with public pricing and APIs.