DeepGEMM: High-throughput Tensor-Core GEMM & MoE kernel library

DeepGEMM is a high-performance CUDA kernel library for NVIDIA SM90/SM100 that delivers JIT-compiled FP8/FP4/BF16 GEMMs and fused MoE/MQA primitives, targeting LLM training and inference workloads that demand extreme throughput and low latency; however, it imposes strict hardware/CUDA requirements and has unresolved licensing and community-sustainability concerns.

GitHub deepseek-ai/DeepGEMM Updated 2026-04-19 Branch main Stars 7.1K Forks 933

CUDA PyTorch Tensor Cores GEMM FP8/FP4/BF16 MoE JIT compilation High-performance inference

💡 Deep Analysis

What core performance bottlenecks in modern LLM inference/training does DeepGEMM address, and how does it solve them specifically?

Core Analysis ¶

Project Positioning: DeepGEMM targets three concrete problems on the LLM critical path: high-throughput low-precision GEMMs (FP8/FP4/BF16), MoE segmented/sparse GEMM and communication overhead, and management of new low-precision formats (scales/layouts). It addresses these with a unified readable CUDA codebase plus runtime JIT to provide specialized high-performance kernels and lightweight tooling.

Technical Features (from evidence)¶

Multi-precision tensor cores: Native implementations for FP8, FP4, BF16 and FP8xFP4 inner products (README lists these).
Grouped/Masked GEMM API: Provides contiguous and masked layouts to match MoE expert partitioning and reduce copies.
Mega MoE fusion: Fuses dispatch, linear layers, and communication into a single kernel and overlaps NVLink transfers to reduce communication impact.
Lightweight JIT compilation: Runtime compilation avoids heavy NVCC-time builds, lowering integration friction (README: “compiled at runtime via a lightweight JIT module”).

Usage Recommendations ¶

Prepare weights and scaling factors using the library tools and ensure proper M/K/N alignment before benchmarking.
For Mega MoE, enable symmetric memory and follow the multi-process examples to realize communication–compute overlap.
Validate numerical stability incrementally when adopting FP4/FP8 paths.

Caveats ¶

Warning: DeepGEMM does not automatically perform input transposition or FP8 casting — those preprocessing steps must be implemented or fused in higher layers, otherwise correctness and performance will suffer.

Summary: DeepGEMM addresses LLM throughput and communication bottlenecks via specialized low-precision kernels, grouped GEMM APIs, and Mega MoE fusion, with a lightweight JIT to reduce integration cost — making it a practical choice for engineering teams optimizing LLMs on NVIDIA SM90/SM100.

90.0%

What common mistakes occur when using DeepGEMM's FP8/FP4 paths regarding scaling factors and data layouts, and how can these be systematically avoided?

Core Analysis ¶

Key Issue: Common mistakes when using FP8/FP4 paths stem from mismatched scaling-factor formats and layouts, inputs/weights not transposed or aligned as kernels expect, and assuming the library will perform preprocessing. These lead to incorrect results, kernel rejections, or severe performance drop.

Technical Analysis (from README evidence)¶

Scaling format differences: SM90 requires scaling factors in FP32, while SM100 requires packed UE8M0 (4 UE8M0 packed into a torch.int). Wrong formats can be fatal or degrade performance.
Layout and alignment requirements: The library requires the LHS scaling factor to be TMA-aligned and transposed; SM90 implementation only supports NT layout (README).
No automatic preprocessing: README states input transposition and FP8 casting must be handled by the user; provided PyTorch utilities may be slow.

Practical Steps (procedural)¶

Identify target GPU early: Decide SM90 vs SM100 and pick the correct scaling format accordingly.
Use weight conversion tools: Employ DeepGEMM’s weight conversion/packing utilities to generate properly laid out weights and scales rather than hand-crafting conversions.
Validate alignment: Use the library’s alignment query APIs (e.g., get_mk_alignment_for_contiguous_layout()) to ensure M/K/N satisfy block alignments.
Perform transpose and FP8/FP4 casting at upper layers: Don’t rely on the library to do these for you — fuse or implement them upstream if needed.
Run comprehensive tests: Validate correctness and throughput with tests/test_*.py or your own regression suites before production roll-out.

Important: Supplying FP32 scales to SM100 or UE8M0-packed scales to SM90 will result in format mismatches — strictly follow platform-specific requirements.

Summary: By targeting the correct architecture, using the provided packing tools and alignment checks, performing preprocessing upstream, and running full tests, you can systematically avoid the common FP8/FP4 pitfalls.

88.0%

What hardware and CUDA version differences should be considered when using DeepGEMM on SM90 vs SM100, and how do these differences affect deployment and performance tuning?

Core Analysis ¶

Key Issue: SM90 and SM100 differ in memory-layout support, scaling-factor format, and recommended CUDA versions. These differences directly affect data preprocessing, weight packing, and final performance.

Main Differences and Implications ¶

Memory layouts: SM90 supports only the NT layout (non-transposed LHS, transposed RHS); SM100 supports all layouts (NT/TN/NN/TT).
Implication: On SM90 you must perform upstream transposes/packing to match the kernel; SM100 is more flexible and may reduce preprocessing overhead.
Scaling formats: SM90 requires scaling factors in FP32; SM100 requires packed UE8M0 (4 packed into one torch.int).
Implication: SM90 consumes more scale storage and involves FP32 conversions; SM100 requires packing/unpacking logic.
CUDA version and instruction behavior: SM90 recommends CUDA 12.3 / 12.9+ and SM100 recommends CUDA 12.9+. Changes in NVCC/CUDA (e.g., FFMA interleaving in 12.9) can alter low-level performance.

Configuration and Tuning Recommendations ¶

Customize for target GPU: For SM90, prepare transposed inputs and FP32 scales; for SM100, implement UE8M0 packing and use full-layout kernels.
Run separate benchmarks: Benchmark and validate numerics separately per architecture and CUDA version.
Choose JIT mode carefully: NVRTC can speed compilation but may change performance in some cases; test both modes for your shapes.
Tune runtime parameters: Use set_num_sms, set_tc_util, set_pdl to tune resource allocation per architecture.

Note: Supplying FP32 scales to SM100 or UE8M0-packed scales to SM90 will cause mismatches or performance anomalies.

Summary: Engineering teams should implement architecture-specific packing, scaling handling, and separate benchmarking/tuning workflows for SM90 vs SM100 to reliably achieve DeepGEMM’s expected performance.

88.0%

What are the technical advantages and trade-offs of DeepGEMM's runtime JIT and single CUDA codebase design, and why not choose heavy template-based approaches like CUTLASS?

Core Analysis ¶

Project Positioning: DeepGEMM uses a single, readable CUDA codebase and runtime JIT compilation to reduce integration complexity and increase customizability while aiming to retain performance comparable to heavy template-based libraries.

Technical Advantages ¶

Lower install/integration cost: Avoids large NVCC-time template compilations, reducing environment and build complexity (README: “compiled at runtime via a lightweight JIT module”).
Readable and customizable: A single codebase is easier to inspect and modify compared to deep metaprogramming templates.
Faster support for new features: Runtime JIT makes it simpler to add FP4/FP8 support, Mega MoE fusion, and packing formats without rebuilding complex template hierarchies.

Trade-offs and Limitations ¶

Compile-time or dynamic JIT latency: Runtime compilation introduces load-time overhead; NVRTC can speed this up but may affect performance in some cases.
Runtime tuning required: Parameters like set_tc_util, set_num_sms, and set_pdl must be tuned for target shapes to reach peak performance.
Edge-case peak performance: For extremely specialized shapes, statically generated per-shape specializations (as in CUTLASS) may slightly outperform a JIT approach.

Practical Recommendations ¶

For latency-sensitive deployments, precompile or warm up kernels at startup; test using NVRTC and measure any performance differences.
Benchmark against CUTLASS/vendor libs on the real target shapes and tune runtime knobs for fairness.

Important: JIT improves engineering flexibility but does not automate upper-layer preprocessing (input transposes/FP8 casts still required).

Summary: DeepGEMM makes a pragmatic trade-off favoring maintainability and fast iteration via JIT and a compact codebase — ideal when integration cost and customizability matter. For absolute extreme-case peak throughput at many specialized shapes, heavy template-based solutions may still win.

87.0%

How can Mega MoE in DeepGEMM be used to overlap communication and computation in multi-process/distributed MoE scenarios, and what prerequisites and configurations are required?

Core Analysis ¶

Key Point: Mega MoE fuses dispatch, linear layers, and communication into a single mega-kernel and overlaps NVLink transfers under symmetric shared-memory multi-process setups to increase throughput. Achieving this requires specific system and data prerequisites.

Technical Analysis ¶

Runtime prerequisites: Symmetric shared memory and correctly established multi-process groups (e.g., via torch.distributed) — the README emphasizes running Mega MoE in symmetric memory scenarios.
Data layout and API requirements: Use the library’s grouped/masked GEMM APIs (contiguous or masked) to match expert routing. Ensure inputs/outputs and scaling factors meet alignment and format mandates (SM90 vs SM100 differences).
Verification points: Use hardware/NVLink profiling to confirm that communication and compute are overlapping and that no synchronization or memory-bandwidth bottlenecks dominate.

Practical Integration Steps ¶

Set up multi-process environment: Use torch.distributed or equivalent, ensure processes share symmetric memory and proper groups are created (follow the library examples).
Prepare data and weights: Use DeepGEMM’s conversion tools to produce aligned, contiguous/masked-packed weights and pack scaling factors per platform.
Call Mega MoE interface: Pass symmetric buffer handles and routing metadata to the mega-kernel API, ensuring alignment constraints are satisifed.
Benchmark and tune: Run microbenchmarks on target hardware (SM90/SM100) and tune set_num_sms, set_tc_util, set_pdl to maximize overlap.
Validate: Use NVIDIA Nsight or similar to observe NVLink activity and GPU compute utilization to confirm overlapping.

Important: Incorrect symmetric memory configuration or ignoring alignment/scale format differences can break Mega MoE or degrade performance.

Summary: Mega MoE can reduce MoE communication costs substantially, but only when symmetric memory, strict data layouts/alignments, and hardware-specific runtime tuning are in place to enable true communication–computation overlap.

86.0%

What specific use cases is DeepGEMM best suited for, and when should alternative solutions (e.g., CUTLASS or vendor internal libs) be considered instead?

Core Analysis ¶

Key Point: Choosing DeepGEMM depends on target hardware, needs for low-precision/MoE optimizations, tolerance for integration complexity, and whether cross-platform or extreme-shape peak performance is required.

Best-fit Use Cases ¶

Teams optimizing LLM critical path: Need FP8/FP4/BF16 GEMMs, FP8xFP4 inner products, MQA scoring, or Mega MoE fusion.
MoE-heavy systems: Require grouped/masked GEMMs and communication–compute overlap in distributed MoE.
Rapid iteration and customization: Prefer a readable CUDA codebase to experiment with new low-precision ops and fusion strategies.

When to Consider Alternatives ¶

Heterogeneous/older or non-NVIDIA platforms: DeepGEMM targets SM90/SM100 only; cross-platform needs favor more general libraries.
Absolute per-shape peak performance: For many extreme/odd matrix shapes, compile-time specialization (CUTLASS) or vendor-tuned libraries may yield slightly better peak throughput.
Licensing/compliance needs: README lacks an explicit license; confirm authorization before production—if unacceptable, choose libraries with clear licensing.

Evaluation Steps ¶

Run benchmarks for representative shapes on your target hardware comparing DeepGEMM vs CUTLASS/vendor libs.
Validate numeric stability, especially for FP4/FP8.
Decide: if DeepGEMM meets throughput and reduces integration cost, adopt it; otherwise prefer CUTLASS or vendor implementations.

Note: DeepGEMM fills a gap by lowering integration and readability barriers compared to heavy-template libraries, but be cautious for cross-platform or extreme-shape peak-performance needs.

Summary: DeepGEMM is well-suited for SM90/SM100-targeted teams optimizing low-precision and MoE workloads who value lower integration cost and customizability. For cross-architecture support or the absolute edge-case peak performance, evaluate CUTLASS or vendor libraries.

86.0%

✨ Highlights

High-performance GEMM kernels optimized for Tensor Cores
Runtime JIT compilation, no CUDA compile required at install
Strong dependency on SM90/SM100 architectures and specific CUDA versions
License unknown and contributor/releases metadata is incomplete

🔧 Engineering

Supports FP8/FP4/BF16 and fused MoE, MQA and other key LLM primitives
Lightweight JIT C++ module claiming low CPU overhead and support for SM90/SM100

⚠️ Risks

High requirements for specific hardware and CUDA versions; migration/compatibility cost is significant
Repository license is unknown and contributor/releases records are missing, posing compliance and maintainability risk
Complex numeric formats (FP8/FP4) and required preprocessing must be handled by users, raising adoption barrier

👥 For who?

Targeted at LLM inference/training systems engineers and GPU kernel optimization engineers
Suitable for teams and enterprises requiring extreme throughput or custom MoE/indexer kernels