Variance Categories
Variance can be separated into different groups, which will help understand and fix multiple regressions. The categories include:- Compiler/Linker variance: Whenever the built binary changes, this can
cause code to be executed differently.
- Cache variance: This describes variance caused by different cache behavior. In CI, each benchmark process typically runs once per commit, so cold-cache effects can influence results.
- State-dependent variance: This describes all the variance that is caused
by changing the underlying state of the system.
- Allocator variance: Allocators can execute different code paths, depending on the current state of the allocator. Changing the memory fragmentation at a previous point in time, can cause variance in benchmarks that are executed later.
- Environment variance: Variance caused by the runtime environment.
- CPU variance: If code behaves differently based on the CPU, variance can be introduced. This happens in heavily optimized libraries/programs that might try to detect cache sizes, CPU features or the number of CPU cores.
- Kernel variance: Syscalls can cause variance in benchmarks, as the kernel might execute different code paths depending on the current state of the system.
Strategies
One benchmark, one binary
Most of the issues come from multiple benchmarks being written and run in the same binary. Seemingly unrelated changes to the code, can cause ripple effects that are hard to track down. To fix this, we can compile each benchmark into its own binary. This will fix unrelated variance, as compilers (usually) produce the same binary when given the same input. The only downside to this approach is the increased linker/compilation overhead. For N benchmarks, we will have to compile N binaries. We only recommend this approach for micro-benchmarks which observe a significant amount of variance.How to implement in Rust
In Rust, this can be done by adding a feature flag for each benchmark, which allows us to compile each benchmark into its own binary.Cargo.toml
How to implement in C++
When using C++, we can achieve this by wrapping eachBENCHMARK() in a define.
This allows us to conditionally include/exclude benchmarks while building.
Reducing allocator variance
Tune your allocator
Most allocators expose configuration options that affect determinism. Reducing the number of arenas, disabling caches, and controlling page purging behavior can all help stabilize benchmark results. Refer to the allocator documentation for available options:- glibc
mallopt - jemalloc
MALLOC_CONFdirty_decay_ms:-1,muzzy_decay_ms:-1: This disables returning unused pages back to the OS, which can otherwise randomly slowdown benchmarks.
- tcmalloc
SetProfileSamplingInterval(MAX): Disables heap profile samplingSetGuardedSamplingInterval(-1): Disables GWP-ASan guarded sampling, which otherwise probabilistically guards allocations to detect buffer overflows and use-after-free.SetBackgroundProcessActionsEnabled(false): Disables background memory release actions that can cause timing variance.
Use a custom allocator
In many cases, variance is caused byrealloc which either grows the allocation
in place, or creates a new allocation and moves the previous allocation to the
new one.
Whether in-place growing succeeds depends on the OS memory state, making it
completely non-deterministic. To fix this, we can always run the slow-path
that never grows in-place.
Here is an example in Rust (adapted from
oxc),
which uses the default implementation of
GlobalAlloc::realloc
that always allocates and then copies the memory.
We’re actively exploring how to implement this in our integrations. If you have further questions, please reach out to us via Discord or email.