NanoGPT (124M): 8×H100 3-minute training record

A high-performance NanoGPT training speedrun project that leverages Muon optimizer, FP8, FlexAttention and system-level optimizations to train a 124M model on 8×H100 in ~3 minutes; ideal for teams pursuing extreme training speed but constrained by hardware requirements and unclear licensing.

GitHub KellerJordan/modded-nanogpt Updated 2025-10-15 Branch main Stars 4.4K Forks 582

PyTorch Distributed training FP8/BFloat16 Performance optimization FlexAttention Muon optimizer Docker LLM training speedrun

💡 Deep Analysis

What specific training-efficiency problem does this project solve, and how does it reduce time-to-FineWeb ≤3.28 from tens of minutes to minutes?

Core Analysis ¶

Project Positioning: This repo delivers an end-to-end “speedrun” to reach a clear benchmark (FineWeb ≤3.28) using 8×NVIDIA H100, minimizing both wall-clock time and tokens.

Technical Features ¶

Cross-layer co‑design: Joint changes to model (rotary, QK‑Norm, ReLU², value embeddings, skip connections), optimizer (Muon), and system (reduce_scatter, communication/computation overlap) yield multiplicative speedups.
Numerical/memory engineering: bfloat16 activations and FP8 for head matmuls reduce memory and bandwidth, improving hardware utilization.
Attention & information flow: FlexAttention with long-short sliding windows and window warmup increases context efficiency while lowering compute/communication.

Practical Recommendations ¶

Reproduce incrementally: Enable modifications one-by-one following the repo’s historical records to isolate effects and prevent regressions.
Use the provided container: Run the Docker image to ensure consistent CUDA/NCCL/PyTorch versions and avoid environment-induced performance drops.
Smoke-test at small scale: Validate FP8/bfloat16 and Muon numerics on fewer GPUs/short runs before full speedruns.

Important: These optimizations are tightly tuned for 8×H100; they may not transfer to different GPU types or smaller clusters without retuning.

Summary: The core value is engineering co‑design across model, optimizer, numerics, and communication to drastically reduce time‑to‑target and tokens, making it a concrete reference for teams targeting the same benchmark.

87.0%

When using FP8/bfloat16 low-precision strategies, how should one balance performance and numerical stability? What concrete debugging and monitoring steps are recommended?

Core Analysis ¶

Question Core: How to balance performance gains from FP8/bfloat16 with numerical stability in practice.

Technical Analysis ¶

Combined strategy: The project uses selective low‑precision (FP8 for head matmuls, bfloat16 for activations), plus zero‑init projection/classifier and logit softcap to mitigate early numerical issues.
Risk points: FP8 has limited dynamic range and is sensitive to large gradients or offsets; zero init combined with aggressive LR can stall or destabilize training.

Practical Debugging & Monitoring Steps ¶

Progressive enabling: Turn on bfloat16 on single GPU short runs first, then trial FP8 on the head, observing loss behavior stepwise.
Key telemetry:
- Gradient norms and weight norms (detect explosion or vanishing)
- Layer output distributions (mean/std)
- Short‑term trends of validation and training loss
Automatic fallback: Implement logic to revert a layer to bfloat16/FP32 and reduce LR if gradient norms or validation loss cross thresholds.
Initialization & LR: Use zero init for projection/classifier and conservative warmup schedules as recommended to avoid large early updates.

Important: Do not enable FP8 across the whole network at once. On weaker interconnects or heterogeneous hardware, low‑precision benefits may be negated by communication jitter or numeric instability.

Summary: Low precision yields major speedups only if applied selectively, validated progressively, monitored closely, and coupled with safe fallback mechanisms.

86.0%

What are the real benefits and limitations of the distributed/communication optimizations (reduce_scatter, overlap of communication/computation), and how to evaluate portability across different hardware?

Core Analysis ¶

Question Core: What practical gains and limitations do communication optimizations bring in multi‑GPU training, and how portable are they across hardware?

Technical Analysis ¶

Benefits:
reduce_scatter reduces peak bandwidth of full all‑reduce by sharding aggregation across GPUs.
Communication/computation overlap can hide latency and increase device utilization.
Key dependencies:
Interconnect characteristics (NVLink, InfiniBand) determine whether communication can be hidden.
NCCL/driver versions and topology (intra‑node vs inter‑node) affect real communication behavior.

How to evaluate portability (practical steps)¶

Micro‑benchmarks: Measure communication time (all‑reduce vs reduce_scatter), compute time, and GPU utilization on the target platform.
Topology sensitivity tests: Compare runs on different distributions (same node multi‑GPU vs cross‑node) to see communication share changes.
Profiling: Use NCCL tools and NVPROF/NSight to locate bottlenecks and synchronization points.
Incremental migration: Port low‑risk, high‑gain optimizations first (e.g., reduce_scatter), then more invasive overlap strategies.

Note: On low‑bandwidth or high‑latency interconnects, some optimizations can backfire and degrade performance due to increased complexity and synchronization.

Summary: Communication optimizations pay off on high‑end interconnects; always validate with micro‑benchmarks and topology‑aware profiling before porting to other hardware.

86.0%

From a user perspective, what is the learning curve and common pitfalls when reproducing this speedrun? What actionable best practices reduce the risk of failed reproduction?

Core Analysis ¶

Question Core: The user learning curve, common pitfalls, and practical best practices for reproducing the speedrun.

UX & Common Pitfalls ¶

Steep learning curve: Requires knowledge of PyTorch (including torch.compile), low‑precision numerics, NCCL/distributed primitives, and GPU topology.
Environment/version sensitivity: Mismatched CUDA/NCCL/PyTorch/driver versions cause performance/functionality issues.
Numerical stability risks: FP8/bfloat16 and zero init combined with aggressive LR can lead to explosion or collapse if not monitored.
Debugging complexity: Multiple co‑dependent optimizations make isolating regressions hard.

Actionable Best Practices ¶

Use provided Docker: Ensure consistent CUDA/NCCL/Python/PyTorch versions to avoid environment issues.
Enable changes incrementally: Start from baseline and flip on each optimization one by one while logging impact.
Small‑scale testing first: Validate numerics on single‑GPU or 2‑GPU short runs before full 8‑GPU runs.
Strict timing rules: Exclude torch.compile first‑time jit latency and warmup phases per the repo’s timing conventions.
Automated monitoring & rollback: Track gradient norms, validation loss, and layer activations; auto‑fallback to higher precision or lower LR on anomalies.

Important: Don’t run full speedruns on production clusters without isolated validation—use containers and log scripts for repeatability.

Summary: Reproduction has a high barrier, but containerization, incremental activation, tight monitoring, and small‑scale validation reduce the risk to manageable levels.

85.0%

Why was the Muon optimizer chosen? What are its advantages/disadvantages compared to Adam/SGD and what engineering caveats exist?

Core Analysis ¶

Question Core: Muon was chosen to accelerate convergence while reducing optimizer-related overhead in distributed training (communication and state management), thereby shortening time‑to‑target.

Technical Analysis ¶

Advantages:
Convergence efficiency: The README shows significant time reductions after introducing Muon, indicating more effective per‑step updates.
Distributed friendliness: Muon’s batched/distributed implementation reduces sync frequency and communication load by aggregating optimizer work.
Numerics compatibility: Paired with zero initialization and muP‑like scaling, it can remain stable under low‑precision regimes.
Disadvantages/Risks:
Implementation complexity: Requires engineered batched/distributed aggregation, raising the risk surface for bugs and maintenance cost.
Hyperparameter sensitivity: Needs re‑tuning of LR/momentum for different models/datasets; poor settings can harm convergence.

Practical Recommendations ¶

Stage deployment: Validate Muon on single‑GPU or small clusters before full 8‑GPU runs.
Monitor closely: Track gradient norms, optimizer state norms, and validation loss to detect numeric issues early.
Log optimizer internals: Retain optimizer statistics and step timings for post‑hoc analysis and rollbacks.

Note: Muon’s benefits depend on its distributed implementation; swapping it in-place without adapting communication/state handling may not yield improvements.

Summary: Muon is a practical optimizer to reduce time‑to‑target in distributed settings but requires extra engineering, monitoring, and careful hyperparameter tuning.

84.0%

✨ Highlights

World record: ~3-minute training on 8×H100
Dockerized reproducible run scripts
Requires NVIDIA drivers and specific CUDA/NCCL setup
License unknown and no formal releases

🔧 Engineering

Achieves ~3-minute world-record training on 8×H100
Integrates Muon optimizer, FP8 head and FlexAttention optimizations
Provides reproducible run scripts and Docker-standardized environment

⚠️ Risks

Strong dependence on high-end hardware (8×NVIDIA H100), limited portability
License unclear and no releases, creating compliance and production uncertainties
Repository metadata shows 0 contributors/commits, indicating maintenance and update risk

👥 For who?

Targeted at high-performance training engineers and ML systems researchers
Suitable for teams with HPC, PyTorch and distributed optimization experience