Reducing Variance - CodSpeed Docs

As explained in the previous chapter on Benchmark Variance, there are many possible reasons why your code can be variable. There is no silver bullet, but different solutions can be employed to reduce unexpected variance.

Variance Categories

Variance can be separated into different groups, which will help understand and fix multiple regressions. The categories include:

Compiler/Linker variance: Whenever the built binary changes, this can cause code to be executed differently.
- Cache variance: This describes variance caused by different cache behavior. In CI, each benchmark process typically runs once per commit, so cold-cache effects can influence results.
State-dependent variance: This describes all the variance that is caused by changing the underlying state of the system.
- Allocator variance: Allocators can execute different code paths, depending on the current state of the allocator. Changing the memory fragmentation at a previous point in time, can cause variance in benchmarks that are executed later.
Environment variance: Variance caused by the runtime environment.
- CPU variance: If code behaves differently based on the CPU, variance can be introduced. This happens in heavily optimized libraries/programs that might try to detect cache sizes, CPU features or the number of CPU cores.
- Kernel variance: Syscalls can cause variance in benchmarks, as the kernel might execute different code paths depending on the current state of the system.

Strategies

One benchmark, one binary

Most of the issues come from multiple benchmarks being written and run in the same binary. Seemingly unrelated changes to the code, can cause ripple effects that are hard to track down. To fix this, we can compile each benchmark into its own binary. This will fix unrelated variance, as compilers (usually) produce the same binary when given the same input. The only downside to this approach is the increased linker/compilation overhead. For N benchmarks, we will have to compile N binaries. We only recommend this approach for micro-benchmarks which observe a significant amount of variance.

How to implement in Rust

In Rust, this can be done by adding a feature flag for each benchmark, which allows us to compile each benchmark into its own binary.

Cargo.toml

[features]
bench_foo = []
bench_bar = []

Then annotate each benchmark with the feature flag:

#[cfg(feature = "bench_foo")]
#[divan::bench]
fn bench_foo() {
    //
    //
}

#[cfg(feature = "bench_bar")]
#[divan::bench]
fn bench_bar() {
    //
    //
}

Then run like this:

$ cargo codspeed build -m simulation --features bench_foo
$ cargo codspeed run -m simulation

$ cargo codspeed build -m simulation --features bench_bar
$ cargo codspeed run -m simulation

For now, it’s only possible to build and execute a single benchmark at a time, but we’re exploring how to better integrate this into cargo-codspeed.

name: CodSpeed Benchmarks

on: [push, pull_request]

jobs:
  benchmarks:
    runs-on: ubuntu-latest
    strategy:
      matrix:
        benchmark: [bench_foo, bench_bar]
    steps:
      - uses: actions/checkout@v5
      - uses: dtolnay/rust-toolchain@stable

      - name: Build benchmark
        run:
          cargo codspeed build -m simulation --features ${{ matrix.benchmark }}

      - name: Run benchmark
        uses: CodSpeedHQ/action@v4
        with:
          mode: simulation
          run: cargo codspeed run -m simulation

How to implement in C++

When using C++, we can achieve this by wrapping each BENCHMARK() in a define. This allows us to conditionally include/exclude benchmarks while building.

#ifdef BENCHMARK_BM_StringCopy
    static void BM_StringCopy(benchmark::State &state) {
    std::string x = "hello";
    for (auto _ : state) {
        std::string copy(x);
        benchmark::DoNotOptimize(copy);
        benchmark::ClobberMemory();
    }
    }
    BENCHMARK(BM_StringCopy);
#endif

Then build each benchmark into its own binary:

for define in $(rg -oN 'BENCHMARK_\w+' src | sort -u); do
  cmake -S . -B "build/$define" -DCODSPEED_MODE=simulation -D"$define"=ON
  cmake --build "build/$define"
  cp "build/$define/<your_binary>" "codspeed-results/$define"
done

Then run each benchmark:

for define in $(rg -oN 'BENCHMARK_\w+' src | sort -u); do
  ./codspeed-results/$define
done

In GitHub Actions we then do the same:

name: CodSpeed Benchmarks

on: [push, pull_request]

jobs:
  benchmarks:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v5

      - name: Build all benchmarks
        run: |
          for define in $(rg -oN 'BENCHMARK_\w+' src | sort -u); do
            cmake -S . -B "build/$define" -DCODSPEED_MODE=simulation -D"$define"=ON
            cmake --build "build/$define"
            cp "build/$define/<your_binary>" "codspeed-results/$define"
          done

      - name: Run benchmarks
        uses: CodSpeedHQ/action@v4
        with:
          mode: simulation
          run: |
            for define in $(rg -oN 'BENCHMARK_\w+' src | sort -u); do
              ./codspeed-results/$define
            done

Reducing allocator variance

Tune your allocator

Most allocators expose configuration options that affect determinism. Reducing the number of arenas, disabling caches, and controlling page purging behavior can all help stabilize benchmark results. Refer to the allocator documentation for available options:

glibc mallopt
jemalloc MALLOC_CONF
- dirty_decay_ms:-1,muzzy_decay_ms:-1: This disables returning unused pages back to the OS, which can otherwise randomly slowdown benchmarks.
tcmalloc
- SetProfileSamplingInterval(MAX): Disables heap profile sampling
- SetGuardedSamplingInterval(-1): Disables GWP-ASan guarded sampling, which otherwise probabilistically guards allocations to detect buffer overflows and use-after-free.
- SetBackgroundProcessActionsEnabled(false): Disables background memory release actions that can cause timing variance.

Use a custom allocator

In many cases, variance is caused by realloc which either grows the allocation in place, or creates a new allocation and moves the previous allocation to the new one. Whether in-place growing succeeds depends on the OS memory state, making it completely non-deterministic. To fix this, we can always run the slow-path that never grows in-place. Here is an example in Rust (adapted from oxc), which uses the default implementation of GlobalAlloc::realloc that always allocates and then copies the memory.

use std::alloc::{GlobalAlloc, Layout, System};

#[global_allocator]
static GLOBAL: NeverGrowInPlaceAllocator = NeverGrowInPlaceAllocator;

struct NeverGrowInPlaceAllocator;

unsafe impl GlobalAlloc for NeverGrowInPlaceAllocator {
    unsafe fn alloc(&self, layout: Layout) -> *mut u8 {
        System.alloc(layout)
    }

    unsafe fn dealloc(&self, ptr: *mut u8, layout: Layout) {
        System.dealloc(ptr, layout);
    }
}

We’re actively exploring how to implement this in our integrations. If you have further questions, please reach out to us via Discord or email.

​Variance Categories

​Strategies

​One benchmark, one binary

​How to implement in Rust

​How to implement in C++

​Reducing allocator variance

​Tune your allocator

​Use a custom allocator