Parameter Golf: A parameter-efficiency benchmark and challenge for 16MB models

An open benchmark and challenge to maximize language-model performance under extreme parameter (16MB) and short training (10 minutes) constraints, encouraging compression, quantization, and architectural innovation with a leaderboard and compute grants.

GitHub openai/parameter-golf Updated 2026-04-17 Branch main Stars 4.8K Forks 3.2K

model training parameter-efficiency benchmark/leaderboard compression & quantization

💡 Deep Analysis

What specific problem does this project solve, and what are its core objectives and metrics?

Core Analysis ¶

Project Positioning: Parameter Golf makes the constraints of “≤16MB artifact” and “≤10 minutes training on 8×H100s” first-class, aiming to evaluate models by compressed bits-per-byte on the FineWeb validation set (tokenizer-agnostic).

Technical Features ¶

Explicit dual constraints: Quantifies artifact size and training time to drive trade-offs between parameter efficiency and training speed.
Chain-optimized approach: Jointly designs architecture (parameter tying, depth recurrence), training tricks (EMA/SWA, warmdown, TTT), and compression (GPTQ, QAT, int5/int6, zstd) across the full pipeline.
Reproducible baseline & process: PR/leaderboard and compute grants lower the barrier to producing repeatable engineering improvements.

Practical Recommendations ¶

Reproduce baseline first: Start from provided baselines/PRs to ensure you meet the 16MB and short-training constraints.
Stage optimizations: Keep training stable first (use higher precision or QAT), then apply GPTQ/post-training quantization and self-generated calibration data.
Metric-driven changes: Always measure change via FineWeb tokenizer-agnostic bits-per-byte to avoid tokenizer confounds.

Important Notice: Aggressive quantization (int5/1-bit) without thorough calibration can catastrophically degrade performance.

Summary: The project formalizes “optimal modeling under parameter constraints,” making it well-suited for teams exploring parameter-efficient architectures and end-to-end compression pipelines.

88.0%

How to design a quantization and packaging pipeline that preserves performance while ensuring the 16MB artifact limit?

Core Analysis ¶

Core Issue: How to pack a model into ≤16MB without losing excessive performance under extreme compression.

Recommended Quantization & Packaging Pipeline ¶

Training stage: secure a stable baseline
- Train at higher precision or use QAT to reduce later quantization error.
- Use EMA/SWA and appropriate LR schedules to boost short-run generalization.
Post-training quantization: layered & mixed-bit strategy
- Validate at int8 first, then progressively try int6/int5 per layer.
- Keep higher bitwidths for sensitive layers (embeddings, projections) and aggressive compression for others (even 1-bit/ternary in extreme cases).
- Use GPTQ for post-training quant and Hessian-aware or calibration-aware treatments for sensitive layers (e.g., SDClip).
Calibration & fine-tuning
- Use self-generated calibration samples (matching FineWeb style) and optionally fine-tune a small number of steps to recover quantization losses.
Encoding & packaging
- Encode quantized weights efficiently and compress with zstd (tune level for size vs decompression cost).
- Audit embedding/vocab byte footprint; consider hashed embeddings or vocab pruning.

Practical Tips ¶

Layer-wise evaluation: Measure bits-per-byte and final artifact size after each change and log trade-offs.
Mixed-bit is more robust: Mixed precision usually preserves critical capacity while saving space over a single uniform bitwidth.

Important Notice: Aggressive quantization (un-calibrated int5/1-bit) is risky—validate on small tests and have rollback paths.

Summary: A staged pipeline (train→layered quant→calibration→compress) with mixed-bit strategies and self-calibration is the practical path to ≤16MB with preserved performance.

87.0%

Why focus evaluation on FineWeb tokenizer-agnostic bits-per-byte? What are the technical strengths and limitations of this metric?

Core Analysis ¶

Core Issue: Using FineWeb tokenizer-agnostic bits-per-byte decouples tokenizer differences and centers evaluation on the model’s compressed byte-level generalization.

Technical Strengths ¶

Removes tokenizer confounds: Makes comparisons across embedding/tokenization strategies fairer, preventing spurious gains from tokenizer tweaks.
Direct relation to compression: Bits-per-byte reflects compressed byte-level predictive loss and aligns with artifact-size trade-offs.
Unified benchmark: Enables meaningful cross-comparison of architecture/quantization/packaging changes.

Limitations & Risks ¶

Distribution bias: The FineWeb corpus biases what is measured; models may optimize for that distribution rather than general language ability.
Ignores downstream tasks: Bits-per-byte is a low-level LM metric and does not directly translate to QA, summarization, or classification performance.
Complex interactions: Strategies like hashed embeddings or custom tokenizers interact with this metric; interpretation requires care.

Practical Recommendations ¶

Treat bits-per-byte as primary but not sole metric: Run at least one cross-corpus or downstream quick validation before submission.
Ablate tokenizer/embedding hacks: Confirm gains stem from model improvements, not only tokenization changes.

Important Notice: Do not make production decisions solely on FineWeb bits-per-byte if your target domain differs from FineWeb.

Summary: The metric is appropriate for parameter-constrained comparisons but should be paired with broader generalization checks.

86.0%

Under the ≤16MB artifact and ≤10min training constraints, which architecture designs and training tricks are most effective and why?

Core Analysis ¶

Core Issue: With tight parameter and brief training budgets, how do you maximize modeling capacity per parameter?

Effective Architectural Choices ¶

Parameter sharing / depth recurrence: Reuses weights across depth to emulate deeper networks without adding parameters.
Long context / sparse strategies (e.g., SP8192): Increase effective context or use bucketing to improve byte-level prediction efficiency instead of scaling parameters.
Parallel residuals / QK-Gain: Micro-architectural tweaks to reallocate computation/channel capacity.

Key Training Tricks ¶

EMA / SWA: Boosts generalization in short training windows and reduces hyperparameter sensitivity.
Careful LR schedule & warmdown: Proper decay accelerates stable convergence under tight time budgets.
Test-Time Training (TTT): Provides adaptive gains at eval time; ensure it adheres to legal evaluation rules.

Compression Pipeline ¶

Staged compression: Train stably (possibly with QAT), then apply GPTQ/mixed-bit post-training quantization and calibrate with self-generated data.
Encoding compression (zstd) & embedding compression (hashing/reduced vocab): Final artifact packing matters as much as quantization.

Practical Recommendations ¶

Reproduce validated combos first (e.g., SP8192 + depth recurrence + GPTQ embeddings) and iterate.
Prioritize stability under short budgets: use EMA/SWA and many small quick experiments.

Important Notice: Aggressive quantization must be calibrated and validated incrementally or it can catastrophically harm performance.

Summary: Parameter reuse and efficient context strategies, combined with robust training schedules and staged quantization, are the most dependable route to strong performance under the challenge constraints.

86.0%

What common practical issues will I face when using the repository and reproducing leaderboard results, and how do I debug/fix them?

Core Analysis ¶

Core Issue: The complexity of the training→quantization→calibration→packaging pipeline and sensitivity to hyperparameters causes reproducibility challenges and common failure modes.

Common Issues & Debug Steps ¶

Post-quantization collapse: Validate with int8 first, check GPTQ calibration sample size and distribution; use self-generated calibration data when applicable.
Training instability/non-convergence: Inspect LR schedule, warmup/warmdown, weight decay, and EMA settings; under short budgets use more conservative step sizes and decay.
Evaluation mismatch / tokenizer errors: Ensure you run the exact FineWeb tokenizer-agnostic evaluation scripts used by the leaderboard to avoid spurious gains.
Artifact >16MB: Audit embedding/vocab sizes, enable zstd compression, and consider mixed-bit strategies (partial int6+int5).

Practical Fixes ¶

Stage reproduction: Reproduce baseline checkpoints first, then apply post-training quantization, and finally package and measure bits-per-byte.
Incremental aggression: Change only one factor at a time (quantization or architecture) and validate quickly to isolate effects.
Strict version/config management: Lock code, deps, data splits, and evaluation scripts; keep training logs and calibration samples for comparison.

Important Notice: Small hyperparameter tweaks can produce large outcome swings under short budgets—use EMA/SWA and many small experiments.

Summary: Incremental, stage-wise engineering with tight config control lets you locate issues and reliably reproduce leaderboard results.

86.0%

How to design experiments that avoid overfitting to the FineWeb metric while still achieving strong leaderboard performance?

Core Analysis ¶

Core Issue: How to avoid overfitting to the FineWeb bits-per-byte metric while still achieving competitive leaderboard performance.

Experimental Design Principles ¶

Dual-track validation: For each submission or major change, log the primary metric (FineWeb bits-per-byte) and at least two auxiliary metrics (byte-level loss on different corpora and a small downstream task like classification or perplexity on another corpus).
Sliding-window & context robustness tests: Evaluate across varying context lengths to detect overfitting to particular context distributions.
Limit tokenizer-hacking: Avoid spending engineering effort on tokenization/vocab tricks that only boost FineWeb but harm generalization.

Tactical Strategies ¶

Prefer general compression methods (layered mixed-bit, GPTQ calibration) rather than FineWeb-specific heuristics.
Use TTT/LORA as legal boosts: Apply them as adaptive evaluation-stage techniques rather than modifying core model weights broadly.
Run strict ablations: Each new technique must have a control comparison and record impact on both primary and auxiliary metrics.

Important Notice: Single-metric optimization often yields brittle models—mandatory multi-metric evaluation is the guardrail against overfitting.

Summary: Treat FineWeb as the main benchmark but enforce multi-metric validation and generality-first policies to keep leaderboard gains meaningful and transferable.

84.0%

✨ Highlights

Unique evaluation target: 16MB artifact and 10-minute training limit
Active leaderboard with numerous practical submissions
Repository metadata (license, languages, contributors) is incomplete or not public
High hardware requirement (8xH100) is inaccessible for many developers

🔧 Engineering

Open challenge platform centered on 16MB models and compression performance
Provides leaderboard, participant forms, and OpenAI compute grant channels

⚠️ Risks

Missing explicit license and tech-stack details may hinder adoption and contributions
Despite time-limited rules, required high-end GPUs and engineering effort create a high barrier

👥 For who?

ML researchers and model engineers focused on parameter and compression optimization
University competitors and industrial research teams for prototyping and benchmark comparison