One Billion Row Challenge: Java one-billion-row aggregation performance contest

1brc is a Java-focused high-performance aggregation benchmark and implementation collection that, via a uniform input format and evaluation environment, encourages extreme optimizations—ideal for performance engineers to compare, experiment, and teach.

GitHub gunnarmorling/1brc Updated 2025-09-01 Branch main Stars 7.3K Forks 2.1K

Java High-performance benchmark GraalVM / Native image File parsing & aggregation Unsafe-based optimizations Apache-2.0 license

💡 Deep Analysis

What is the core problem this project solves and how is it implemented technically?

Core Analysis ¶

Project Positioning: 1brc aims to demonstrate and compare how far Java/JVM can be pushed to aggregate one billion rows from a deterministic text format (station;value with exactly one decimal). The repository is a reproducible performance playground rather than a general-purpose library.

Technical Features ¶

Low-allocation byte-level parsing: Convert fixed-one-decimal floats into integers (×10) and parse bytes directly to avoid intermediate String/Float allocations, reducing GC pressure and increasing throughput.
Minimized memory & object reuse: Use pooling, off-heap buffers or Unsafe to accumulate stats and reduce heap churn.
Parallel/shard processing: Partition the file or stations to saturate multi-core CPUs.
Native execution (GraalVM native-image): Eliminates JVM dynamic overhead and startup delays; top entries leveraged native images for second-scale runs.

Practical Recommendations ¶

Start with the safe implementation: Validate correctness with the maintainable version before applying extreme optimizations.
Adopt parsing patterns: Integerization and byte-level parsing are transferable techniques even if you avoid Unsafe.
Match the evaluation environment: Use the provided scripts/Docker and match JDK/Graal and hardware when reproducing leaderboard times.

Important Notice: Top-performing entries prioritize throughput over maintainability and portability (they often depend on Unsafe and native-image). Use them as technical references, not production drop-ins.

Summary: 1brc provides a reproducible, multi-implementation platform that demonstrates concrete steps (and trade-offs) to maximize single-machine throughput for large-scale text parsing and aggregation on the JVM.

85.0%

Why does the project favor the 'integerization + byte-level parsing + low-allocation' approach, and what are the concrete benefits of these techniques?

Core Analysis ¶

Core Question: Why replace float parsing with integerization and favor byte-level parsing plus low-allocation strategies?

Technical Analysis ¶

Deterministic input matters: The measurement having exactly one decimal allows multiplying by 10 and representing values as integers, avoiding heavy float parsing paths.
Avoid short-lived objects: Typical parsing creates many String or boxed types that trigger frequent GC, throttling throughput. Byte-level parsing operates directly on buffers and avoids allocations.
Faster arithmetic and accumulation: Using integer accumulators (sum/count/min/max) is faster and can be implemented with 64-bit primitives with minimal synchronization.
Better cache behavior: Native arrays or off-heap layouts are friendlier to CPU caches and prefetching than many small objects, improving throughput further.

Practical Recommendations ¶

Prefer integerization when possible: If the format allows, converting fixed decimals to integers is a low-cost, high-payoff optimization.
Encapsulate a byte-level parser: Create a reusable parser module rather than copying parsing logic around.
Optimize in phases: Start with correctness and maintainability, then introduce byte-level parsing and allocation reduction in hotspots.

Important Notice: These optimizations rely on strict input guarantees. If input can be malformed or requires higher precision, blind integerization or skipping checks risks incorrect results.

Summary: For well-formed, deterministic large-scale text parsing, integerization + byte-level parsing + low allocation yields the most direct and effective performance improvements—explaining the success of top 1brc submissions.

85.0%

How can one reproducibly replicate the leaderboard results across different hardware and JDK versions? What are the critical points?

Core Analysis ¶

Core Question: How to reproducibly replicate leaderboard results?

Technical Analysis (Critical Variables)¶

Hardware characteristics: CPU microarchitecture, core count, cache sizes, NUMA layout and memory bandwidth materially affect throughput. The leaderboard ran on Hetzner AX161 (AMD EPYC 7502P).
JDK / Graal version: Top entries used 21.0.2-graal and native-image; mismatches here can create large performance differences.
I/O and file caching: Disk and FS caching behavior or use of memory-mapped I/O affects read speed; be explicit about prewarming and cache state.
System settings: CPU frequency governors, CPU pinning, cgroups, HugePages influence stability and peak performance.

Practical Steps to Reproduce ¶

Use the provided scripts/Dockerfile: Start with the repo’s automation to reduce environmental differences.
Match JDK/Graal and build flags: Exactly reproduce the native-image build and runtime versions listed in the results.
Fix system-level configuration: Disable power save, set CPU affinity, ensure no competing workloads.
Run multiple trials and use robust statistics: Report median/min times across runs to reduce noise.

Important Notice: Even with strict matching, minor differences (kernel updates, BIOS settings) may cause variance. Treat the leaderboard as conditionally comparable, not absolute.

Summary: Reproducing leaderboard numbers requires precise alignment of hardware, runtime, and OS settings; the repo’s scripts and certificates are essential baselines.

85.0%

What is the learning curve and common pitfalls for this project? What should I watch out for when getting started?

Core Analysis ¶

Core Question: What is the real onboarding cost and common pitfalls for 1brc?

Technical Analysis (Learning Curve & Pitfalls)¶

Learning curve:
Low barrier to start: The repo contains readable safe implementations to learn the task.
High cost to reach top performance: Reproducing top entries requires Unsafe, off-heap memory, memory-mapped I/O, GraalVM native-image, multi-threading/NUMA optimizations, and GC tuning.
Common pitfalls:
Portability issues: Relying on sun.misc/Unsafe or specific Graal versions can break across JVMs/OSes.
Correctness risks: Skipping input validation or rounding logic for speed can produce incorrect aggregates.
Non-stable performance: Gains are sensitive to hardware, kernel, and JDK; leaderboard times are conditional.

Practical Onboarding Steps ¶

Run the safe implementation and validate correctness: Use provided samples and write unit/e2e tests.
Profile to find hotspots: Optimize only hot paths (don’t micro-optimize prematurely).
Introduce platform-dependent techniques incrementally: Isolate Unsafe or native-image usage into well-tested modules.
Validate on target hardware: Perform full regression on production-like machines before shipping optimizations.

Important Notice: Don’t blindly copy extreme implementations into production. Extract transferable patterns (integerization, allocation reduction, sharding) and avoid unstable APIs.

Summary: 1brc is easy to start but expensive to master. A phased approach with strong testing reduces risk and yields practical gains.

85.0%

Are these extreme optimizations suitable for direct production use? In what scenarios are they worth adopting, and when should they be avoided?

Core Analysis ¶

Core Question: Should the extreme optimizations from 1brc be directly migrated into production?

Technical Analysis (Applicability & Limits)¶

Appropriate scenarios:
Controlled offline batch: Fixed hardware and single-tenant machines (e.g., nightly ETL) where specialized tuning is acceptable.
Single-machine throughput bottlenecks: When per-node throughput drives cost and the team can bear higher maintenance.
Research/POC: To validate feasibility and quantify gains.
Not recommended:
Multi-tenant cloud environments: Restricted permissions and variable hardware make Unsafe/native-image approaches fragile.
Long-lived maintainable systems: Teams that require readable, portable code should avoid complex low-level tricks.

Practical Migration Guidance ¶

Extract transferable techniques: Integerization, allocation reduction, and sharding are safe to migrate.
Isolate unstable APIs: If Unsafe or native-image is needed, encapsulate it in audited modules with a fallback.
Add heavy validation and regression tests: Cover rounding and parsing edge cases and test across different hardware.
Weigh maintenance cost vs performance: Quantify hardware savings vs increased engineering burden.

Important Notice: Don’t let contest results drive production decisions alone—balance performance with maintainability, portability, and security.

Summary: Extreme optimizations are useful in controlled or research contexts. For production, prioritize migrating robust parsing and allocation strategies and confine risky low-level techniques.

85.0%

In terms of parallelism and I/O strategies, what are the trade-offs between memory-mapped I/O, direct I/O and streaming reads? How to choose for a task like 1brc?

Core Analysis ¶

Core Question: How to choose between memory-mapped I/O, direct I/O and streaming reads for large sequential read tasks like 1brc?

Technical Trade-offs ¶

Memory-mapped I/O (MappedByteBuffer)
Pros: Near zero-copy semantics, treat file as memory for good cache locality and high throughput on large-memory machines.
Cons: Page-fault handling complexity, virtual address pressure, and concurrency caveats.
Direct I/O
Pros: Bypasses kernel page cache for stable, predictable disk bandwidth—useful for controlled benchmarks.
Cons: Requires aligned buffers, is more complex and inconsistent across platforms, and may not always be faster.
Streaming reads (Buffered/Channel reads)
Pros: Simple, portable, and maintainable. Large buffers reduce syscall frequency.
Cons: Still involves kernel<>user copies and potentially more syscalls, so may underperform mmap/direct in extremes.

Practical Guidance for 1brc ¶

If ample memory and permissions exist: Prefer MappedByteBuffer for minimal copying and best cache behavior.
If you need measurement stability or to bypass caches: Consider direct I/O, but be ready for alignment and portability work.
If portability or restricted environment: Use FileChannel + large ByteBuffer as a robust compromise.
Always combine with parallel sharding: Partition the file and do local aggregation to avoid global contention.

Important Notice: IO performance varies greatly across filesystems and kernel versions—benchmark in the target environment.

Summary: For 1brc-like sequential, read-only workloads, memory-mapped I/O (if available) or large-block FileChannel reads are preferred; direct I/O is reserved for cases demanding strict control over caching.

85.0%

✨ Highlights

Includes leaderboard and certificates, driving community optimization contest
Clear task and input format with reproducible evaluation
Top implementations rely on Unsafe/Graal, limiting portability
Few maintainers and commits, no releases, reproducing requires specific hardware

🔧 Engineering

Java-centered high-performance aggregation benchmark and collection of implementations
Provides a unified data format, evaluation scripts, and validated result certificates
Sample implementations cover optimizations from pure Java to Graal native images

⚠️ Risks

Reliance on Unsafe or native images can cause platform compatibility and safety issues
Results are hardware/scheduling sensitive; reproducing experiments requires similar environment
No formal releases and limited contributors imply uncertainty for long-term maintenance

👥 For who?

Performance engineers and systems programmers for extreme optimizations and implementation comparisons
Researchers and educators for teaching high-performance I/O and parallel aggregation techniques