Superhuman: Advanced math-reasoning benchmarks & research agents

Superhuman gathers DeepMind's benchmarks, datasets, and agent examples for advanced math reasoning—enabling evaluation on short-answer, proof, and grading tasks—but parts depend on closed models and the repo shows limited activity, so users should watch reproducibility and maintenance risk.

GitHub google-deepmind/superhuman Updated 2026-02-14 Branch main Stars 371 Forks 26

Mathematical reasoning Benchmark datasets Research agents Evaluation & auto-grading

💡 Deep Analysis

Why use large language models with an iterative generate–verify–revise architecture? What are the technical advantages?

Core Analysis ¶

Motivation: Using an LLM with a generate→verify→revise loop balances LLMs’ strong generative ability with the need for proof correctness—LLMs produce candidate proofs quickly while verification steps find and correct errors.

Technical Features ¶

Generative Strength: Large models (e.g., Gemini) are effective at producing structured proof drafts from natural language.
Verification Loop: Introducing verification (self-checking, heuristic checks, or external verifiers) helps catch semantic/logic flaws and reduce hallucinations.
Modularity: Separating benchmarks, agent, and evaluation allows swapping verification components (e.g., symbolic solvers or formal provers).

Usage Recommendations ¶

Build Verification First: When reproducing Aletheia, implement robust verification (unit checks, independent proof validation, adversarial tests).
Layered Integration: Treat LLM outputs as drafts and combine them with symbolic tools or theorem libraries for greater formal correctness.

Caveats ¶

Dependence on large models: Equivalent generation quality is hard without similar LLMs.
Verification is not fully formal: Current checks are often heuristic or model-based and not replacements for formal proof assistants.

Important Notice: For machine-verifiable proofs, integrate a formal verifier into the iterative loop or serialize outputs into verifiable proof scripts.

Summary: The architecture pragmatically leverages LLM generation and verification to improve robustness—suitable for advancing competition-level mathematical reasoning research.

86.0%

What practical challenges do researchers face reproducing Aletheia’s workflow, and how can they lower the barrier?

Core Analysis ¶

Main Challenges: Reproducing Aletheia typically faces three issues: limited access to models/compute, complexity of implementing verification, and evaluator disagreement with human judgment.

Technical Analysis ¶

Model Availability: Aletheia is Gemini-powered. If Gemini is inaccessible, using alternative cloud APIs or strong open-source LLMs is necessary but will introduce behavioral differences.
Verification Overhead: A robust verify step may need heuristics, symbolic tools, or formal interfaces—these are engineering-heavy and compute-intensive.
Evaluation Consistency: IMO-GradingBench’s 1,000 human gradings help calibrate automated graders, but human subjectivity remains—human-in-the-loop checks are essential.

Practical Recommendations ¶

Reproduce Examples First: Use prompts/outputs in the README as a baseline; replicate a few examples to ensure process and metrics match.
Alternative Model Strategy: If Gemini is unavailable, test both a strong closed-source API and a high-quality open-source model and document differences.
Layered Verification: Start with lightweight heuristics (step completeness, equation checks) and progressively integrate symbolic or formal tools.

Caveats ¶

Log prompt, seed, temperature, and model versions for diagnostics.
Maintain human review especially for proofs, even with verification steps.

Important Notice: Model access and compute are the main practical barriers—validate methodology on a small scale before scaling.

Summary: Reuse public prompts/outputs, adopt alternative models, implement layered verification, and retain human review to reproduce Aletheia’s approach within constrained resources.

84.0%

How can the project’s generate–verify–revise workflow be combined with formal proof tools or symbolic computation to improve proof verifiability?

Core Analysis ¶

Goal: Combine Aletheia’s generative strengths with formal/symbolic tools to convert natural-language or semi-structured proofs into machine-verifiable proofs.

Technical Approach (Steps)¶

Extraction & Structuring: Parse LLM outputs into propositions, lemmas, and key derivation steps, producing a structured intermediate representation (e.g., JSON or tagged steps).
Symbolic Verification Layer: Invoke CAS (e.g., SymPy, Sage) to validate algebraic/geometric computations or equality transformations for quantifiable substeps.
Formal Translation: Map verified steps into Lean/Coq/Isabelle assertions or proof scripts; use automation tactics or manual edits to complete proofs.
Closed-loop Feedback: Feed verifier counterexamples or failure reasons back into the LLM prompt to trigger revision steps.

Usage Recommendations ¶

Integrate Incrementally: Start with lightweight symbolic checks (numeric/algebraic), then progress to formal mapping into proof assistants.
Design Clear Interfaces: Define a compact intermediate representation and error taxonomy so verifiers can return actionable feedback.

Caveats ¶

Automatic translation to formal scripts is imperfect and may require human intervention.
Formal verification increases engineering complexity and computational cost.

Important Notice: If high-assurance proofs are the goal, incorporate verification interfaces and human review early in the pipeline.

Summary: A layered extract–symbolic-verify–formalize–feedback approach ties the project’s generative capabilities to formal verification, substantially improving proof verifiability and trust.

84.0%

How should one assess the statistical validity of the benchmarks (IMO-AnswerBench/ProofBench) provided by the project for model generalization studies?

Core Analysis ¶

Statistical Status: IMO-AnswerBench (400 problems) provides reasonable basis for short-answer statistical analysis, while IMO-ProofBench (60 proofs) is small for stable evaluation of proof generation or generalization, potentially causing high variance.

Technical Analysis ¶

Sample Size Effects: 400 short-answer items can support confidence intervals and hypothesis tests if difficulty and topic distributions are balanced. 60 proofs are usually insufficient to reliably evaluate model generalization across proof structures or reasoning strategies.
Source of Bias: Expert vetting improves quality but may introduce reviewer bias affecting external validity.

Evaluation Recommendations (Practical)¶

Confidence Intervals & Power Analysis: Report confidence intervals, p-values, and power analyses rather than only mean accuracy.
Clustered Evaluation: Evaluate by clusters (type/difficulty/math subfield) to reveal model weaknesses.
Augment Data & Transfer Tests: Use external datasets or additional curated proofs to validate robustness of conclusions.
Repeated Trials & Seed Control: Fix random seeds and log prompts/hyperparameters to reduce experimental variance for generative models.

Caveats ¶

Do not draw broad generalizations from 60 proof items alone—state sample limitations explicitly.
Use IMO-GradingBench to calibrate graders and report inter-rater reliability where possible.

Important Notice: Explicitly acknowledge benchmark size limits and robustness checks in publications to avoid overgeneralization.

Summary: IMO-AnswerBench is a moderate-scale benchmark for short answers; IMO-ProofBench is small for proofs and should be supplemented with more data and rigorous statistical controls to support generalization claims.

82.0%

✨ Highlights

Includes multiple high-level mathematical evaluation benchmarks and datasets
Research artifacts and examples linked to papers and reported IMO achievements
Software and materials licensed under Apache-2.0 and CC-BY-4.0 respectively
No releases or notable code-contribution history in the repo; reproducibility may be limited
Some work depends on closed models (e.g., Gemini Deep Think), impacting reproducibility

🔧 Engineering

High-difficulty math benchmarks such as IMO-AnswerBench, IMO-ProofBench, and IMO-GradingBench
Aletheia: example math-research agent driven by Gemini with provided prompts and outputs
Aggregates AlphaGeometry series and related research data for comparative and follow-up studies

⚠️ Risks

Code and contributor metrics indicate low activity; long-term maintenance is uncertain
Dependence on closed large models or proprietary services limits reproducibility and commercial deployment
No formal release found; documentation and examples may lack runnable engineering instructions

👥 For who?

AI/ML researchers and benchmark developers in mathematical reasoning
University instructors, math-competition trainers, and automatic-grading researchers
Experiment reproducers and engineering teams that may rely on closed-model integrations