cargo-slicer

Rust builds are slow. cargo-slicer makes them fast.

Two complementary techniques work together:

TechniqueWhat it doesTypical gain
Virtual SlicerStubs unreachable functions at the MIR level so LLVM never sees them1.2–1.5× per workspace
Warm-Cache DaemonPre-compiles registry crates once, serves cached .rlib files on every subsequent buildskips 100% of registry compilation

You do not need to understand the internals to use them. The all-in-one script runs the full pipeline in one command.

Real-World Results

Verified benchmarks (Apr 2026, host-native, no warm cache)

Both baseline and vslice-cc use identical RUSTFLAGS (-Z threads=8, wild linker). 2–3 runs per mode, 48-core machine.

ProjectBaselinevslice-ccSpeedup
helix (16 crates)68 s44 s1.55×
ripgrep (50K LOC)10.5 s7 s1.50×
zed (209 crates)1098 s767 s1.43×
zeroclaw (4 crates)686 s522 s1.31×
nushell (41 crates)103 s82 s1.26×

Docker image (with pre-warmed registry cache)

ProjectBaselinebuild-slicerSpeedup
zeroclaw (4 crates)794 s547 s1.45×

Registry-cache speedups (warm-cache daemon alone, verified Apr 2026)

Both baseline and warmed use nightly + -Z threads=8. Interleaved rounds, dispatch pre-warmed, rm -rf target/ before each run.

CrateBaselineWarmedSpeedup
image 0.254.9 s2.1 s2.3×
syn 2.01.0 s0.66 s1.5×

An earlier version of this table claimed 8.5× for image (40.7 s → 4.8 s) and 1.7× for syn (6.7 s → 4.0 s). Those baselines were measured without -Z threads=8 and the wild linker, while the warmed runs had them — the same apples-to-oranges error as the nushell 5.1×. cargo 0.87.1 (claimed 2.3×) is a regression with fair RUSTFLAGS (dispatch overhead serializes the parallel build).

Requirements

  • Rust stable (source slicing, warmup CLI)
  • Rust nightly (virtual slicer — requires rustc-driver feature)
  • Linux, macOS, or Windows (WSL recommended on Windows)

Getting Started

Docker (quickest start — no installation needed)

Pull the pre-built image and run it against any Rust project:

docker run --rm --cpus=48 \
  -v $(pwd):/workspace/project \
  ghcr.io/yijunyu/cargo-slicer:latest

The image includes all binaries (cargo-slicer-rustc, cargo_warmup_pch, etc.) and a pre-warmed registry cache. --cpus=48 ensures the container uses all available cores. Replace 48 with the output of nproc on your machine. Verified on zeroclaw: 1.45× speedup (794 s → 547 s) vs plain cargo build --release.

First run: the container runs cargo-slicer pre-analyze automatically if no .slicer-cache/ directory is found, then builds with the full 3-layer pipeline.


Install

# Stable binary (source slicing + warmup CLI)
cargo install cargo-slicer

# Nightly driver (virtual slicer — the fast path)
cargo +nightly install cargo-slicer \
  --features rustc-driver \
  --bin cargo-slicer-rustc \
  --bin cargo_slicer_dispatch

If you are building from source:

git clone https://github.com/yijunyu/cargo-slicer
cd cargo-slicer

cargo install --path .
cargo +nightly install --path . --profile release-rustc \
  --features rustc-driver \
  --bin cargo-slicer-rustc \
  --bin cargo_slicer_dispatch

WSL on Windows drives (/mnt/c/, /mnt/d/): prefix every cargo install with SCCACHE_IDLE_TIMEOUT=0 to avoid NTFS permission errors. See Troubleshooting.

Quickstart: one command

Run the full pipeline against your project:

cd your-project
cargo-slicer.sh .

This runs four steps automatically:

  1. Warm the registry cache (cargo-warmup init --tier=1)
  2. Pre-analyze the workspace call graph
  3. Plan the critical compilation path
  4. Build with the three-layer RUSTC_WRAPPER chain

On the first run the warmup step takes ~10 minutes (it compiles the top-tier registry crates once). Every subsequent cold build is served from cache.

Manual setup (step by step)

If you prefer to control each step:

# Step 1: warm the registry cache (one-time, ~10 min)
cargo-warmup init --tier=1

# Step 2: pre-analyze the workspace (seconds)
cd your-project
cargo-slicer pre-analyze

# Step 3: build with virtual slicing
cargo clean
CARGO_SLICER_VIRTUAL=1 CARGO_SLICER_CODEGEN_FILTER=1 \
  RUSTC_WRAPPER=$(which cargo_slicer_dispatch) \
  cargo +nightly build --release

Important: never set RUSTC_WRAPPER when building cargo-slicer itself. Unset it before running cargo install --path ..

How It Works

Rust builds are slow for a structural reason: rustc compiles each crate in isolation. When it compiles a library crate it cannot know which of its public functions will be called by downstream code, so it compiles all of them. In a large workspace most of that work is wasted.

cargo-slicer attacks the problem from two angles simultaneously.

The two techniques

Virtual Slicer

The virtual slicer inserts itself as a RUSTC_WRAPPER. Before each crate is compiled it performs a reachability analysis — a BFS starting from the entry point of the binary — and replaces every unreachable function body with an abort() stub. LLVM never sees those functions, so it never optimises, inlines, or emits machine code for them.

For a crate like image, which exposes hundreds of format decoders and pixel converters, a typical application uses one or two formats. The slicer stubs the rest. LLVM's work drops by 97%.

Virtual Slicer details

Warm-Cache Daemon

Registry crates (crates.io dependencies) do not change between builds. The warm-cache daemon pre-compiles them once and stores the resulting .rlib and .rmeta artefacts. On every subsequent build, instead of re-running rustc, the RUSTC_WRAPPER copies the cached artefact into target/ in milliseconds.

The cache key is SHA256(crate + version + rustc_version + features + opt_level), so the cached artefact is safe to share across projects and across git branches. A warmed cache built while compiling zed is reused immediately when compiling nushell.

Warm-Cache Daemon details

The three-layer pipeline

When both techniques run together, three wrappers are chained:

cargo build --release
    │
    ▼  RUSTC_WRAPPER = cargo_warmup_dispatch
    │  ├─ registry crate? → serve from cache, return immediately
    │  └─ local crate? → pass to next wrapper
    │
    ▼  CARGO_WARMUP_INNER_WRAPPER = cargo_slicer_dispatch
    │  ├─ no unreachable fns (cache hit)? → pass to real rustc directly
    │  └─ has unreachable fns? → pass to driver
    │
    ▼  CARGO_SLICER_DRIVER = cargo-slicer-rustc
       └─ MIR analysis → stub unreachable fns → LLVM codegen on minimum set

The cargo-slicer.sh script sets up this chain automatically.

The All-in-One Script

Virtual Slicer

The virtual slicer is a RUSTC_WRAPPER that stubs unreachable functions at the MIR level before LLVM sees them. It does not modify your source files or your Cargo.toml.

What "unreachable" means

Starting from the binary's main function, cargo-slicer traces the call graph across all workspace crates. Any function that cannot be reached from main is replaced with an abort() body. LLVM skips compilation, optimisation, and code emission for those functions entirely.

The analysis is conservative: trait impls, generics, async functions, closures, and any function called through a function pointer are always kept.

How it plugs into cargo

RUSTC_WRAPPER=cargo_slicer_dispatch  ← stable binary, < 1 ms startup
    │
    └─ local workspace crate?
           CARGO_SLICER_DRIVER=cargo-slicer-rustc  ← nightly driver, ~300 ms startup
               └─ BFS reachability analysis
               └─ MIR stub replacement
               └─ CGU filtering (skip codegen units with only stubs)

The dispatch binary keeps the nightly driver out of the fast path. Registry crates (which change rarely and are cached) never pay the 300 ms driver load.

Cross-crate pre-analysis

For accurate reachability across crate boundaries, run pre-analysis before the build:

cargo-slicer pre-analyze

This uses syn-based parsing to build a call graph across all workspace crates in seconds, writing results to .slicer-cache/. The driver reads these files at build time instead of re-analysing from scratch for every crate.

Without pre-analysis the slicer falls back to conservative per-crate analysis, which still works but produces fewer stubs.

Tuning

Environment variableEffect
CARGO_SLICER_VIRTUAL=1Enable virtual slicing
CARGO_SLICER_CODEGEN_FILTER=1Skip CGUs that contain only stubs
CARGO_SLICER_DEBUG=1Write a debug log to .cargo-slicer-debug.log
CARGO_SLICER_SKIP_THRESHOLD=autoSkip driver for crates with no predicted stubs (default)
CARGO_SLICER_SKIP_THRESHOLD=0Always load the driver for every local crate

What cannot be stubbed

The slicer never stubs:

  • Trait impl associated functions (vtable entries)
  • Generic functions (monomorphised at the call site)
  • async fn and closures
  • unsafe fn (unless CARGO_SLICER_RELAX_UNSAFE=1)
  • Any function reachable through a function pointer

These constraints are intentional. Stubbing them would either cause linker errors or produce incorrect binaries.

Upstream proposal

The virtual slicing logic has been extracted into a proposed rustc patch behind a -Z dead-fn-elimination flag. If accepted upstream, the install story becomes:

RUSTFLAGS="-Z dead-fn-elimination" cargo +nightly build --release

No extra binary, no nightly ABI compatibility shims.

Warm-Cache Daemon

The warm-cache daemon (also called cargo-warmup) pre-compiles registry crates once and serves the cached .rlib / .rmeta artefacts on every subsequent build. It is the Rust equivalent of a precompiled-header daemon for C/C++.

The insight

Registry crates do not change between your builds. syn, serde, tokio, proc-macro2 — these are compiled identically every time you run cargo clean && cargo build. Compilation caches like sccache help on the second build, but every fresh environment (new developer, CI machine, Docker container) pays the full cost again.

The warm-cache daemon shifts that cost to a one-time investment. Pre-warm the top-tier registry crates once (takes ~10 minutes). Every cold build afterwards — in any project that depends on those crates — skips their compilation entirely.

Cache key

The cache key is:

SHA256(crate_name + version + rustc_version + edition + features + opt_level)

-C metadata and -C extra-filename are excluded. These differ per project but do not affect the correctness of the compiled artefact. Excluding them is what enables cross-project sharing: the .rlib compiled while building zed is reused directly when building nushell.

Usage

# One-time warm (adds ~10 min, saves that time on every cold build after)
cargo-warmup init --tier=1

# Check cache status
cargo-warmup status

# Use in builds
RUSTC_WRAPPER=$(which cargo_warmup_dispatch) cargo +nightly build --release

cargo-slicer.sh runs cargo-warmup init --tier=1 automatically on first use.

Tiers

TierCrates includedWarm time
--tier=1proc-macro2, quote, syn, serde, tokio, + 5 more core crates~20 s
--tier=2+ 50 most common transitive deps~3 min
--tier=3All crates.io top-500~10 min

Tier 1 gives the best return on investment for most projects. Tier 3 is useful in CI environments where build time is money.

How it plugs into cargo

RUSTC_WRAPPER=cargo_warmup_dispatch
    │
    ├─ cache hit?  → copy .rlib to target/, return in < 1 ms
    └─ cache miss? → invoke real rustc, store result in cache

The dispatch binary adds less than 1 ms per crate invocation on a cache hit.

Sharing the cache across projects

By default the cache lives in ~/.cargo/warmup-cache/. Any project on the same machine with matching crate versions and rustc toolchain automatically benefits from a warm cache built by any other project.

To inspect what is cached:

cargo-warmup status
# or
sqlite3 ~/.cargo/warmup-cache/index.db \
  'SELECT crate, version, cached_at FROM artefacts ORDER BY cached_at DESC LIMIT 20'

The All-in-One Script

cargo-slicer.sh runs the full four-step pipeline automatically.

cargo-slicer.sh /path/to/your/project
# or, from inside the project:
cargo-slicer.sh .

Pass extra cargo build arguments after the project path:

cargo-slicer.sh . --features my-feature
cargo-slicer.sh . --no-default-features

What it does

Step 0 — Warm the registry cache

cargo-warmup init --tier=1

Skipped if the cache is already warm. On first run this takes ~10–20 seconds for tier-1 (the 10 most common registry crates).

Step 1 — Pre-analyze the workspace call graph

cargo-slicer pre-analyze

Builds a cross-crate call graph using syn-based static analysis. Writes .slicer-cache/*.analysis and .slicer-cache/*.seeds. Takes 0.5 s (ripgrep) to 12 s (zed).

Step 2 — Plan the critical path

cargo-warmup pch-plan

Schedules crate compilation in an order that minimises the critical path, so parallelism is maximised across the three-layer wrapper chain.

Step 3 — Build with the wrapper chain

RUSTC_WRAPPER=cargo_warmup_dispatch \
CARGO_WARMUP_INNER_WRAPPER=cargo_slicer_dispatch \
CARGO_SLICER_VIRTUAL=1 \
CARGO_SLICER_CODEGEN_FILTER=1 \
CARGO_SLICER_DRIVER=$(which cargo-slicer-rustc) \
  cargo +nightly build --release "$@"

The three-layer chain:

  1. cargo_warmup_dispatch — serves registry crates from cache (< 1 ms each)
  2. cargo_slicer_dispatch — routes local crates to the driver or real rustc
  3. cargo-slicer-rustc — stubs unreachable functions, filters CGUs

Installation

cargo-slicer.sh is installed alongside the binary:

cargo install cargo-slicer
which cargo-slicer.sh   # → ~/.cargo/bin/cargo-slicer.sh

Or, from a source checkout:

./cargo-slicer.sh .     # runs directly from the repo

Usage Reference

Linux / macOS

# Install nightly driver (one-time)
cargo +nightly install cargo-slicer \
  --features rustc-driver \
  --bin cargo-slicer-rustc \
  --bin cargo_slicer_dispatch

# Pre-analyze workspace call graph (seconds)
cargo-slicer pre-analyze

# Build with virtual slicing
cargo clean
CARGO_SLICER_VIRTUAL=1 CARGO_SLICER_CODEGEN_FILTER=1 \
  RUSTC_WRAPPER=$(which cargo_slicer_dispatch) \
  cargo +nightly build --release

WSL on Windows drives (/mnt/c/, /mnt/d/, …)

Same as above but disable sccache to avoid NTFS permission errors:

SCCACHE_IDLE_TIMEOUT=0 cargo +nightly install cargo-slicer \
  --features rustc-driver \
  --bin cargo-slicer-rustc \
  --bin cargo_slicer_dispatch

cargo-slicer pre-analyze

cargo clean
CARGO_SLICER_VIRTUAL=1 CARGO_SLICER_CODEGEN_FILTER=1 \
  CARGO_SLICER_SCCACHE=/nonexistent \
  RUSTC_WRAPPER=$(which cargo_slicer_dispatch) \
  cargo +nightly build --release

Subcommands

SubcommandDescription
(default)Source slicing: copy deps, delete unused items
build [ARGS]Slice deps then build with sliced crates
pre-analyze [--parser BACKEND]Cross-crate static analysis for virtual slicing
generate [-o DIR] [--delete]Write a sliced source copy without modifying the original
rl-bench [OPTIONS]Measure compile speedup as RL training KPIs

Pre-analysis parser backends

cargo-slicer pre-analyze                # syn (default, most accurate)
cargo-slicer pre-analyze --parser fast  # fast tokenizer
cargo-slicer pre-analyze --parser ctags # items only, no call edges
BackendSpeedCall edgesUse when
syn0.5–12 sYes, accurateDefault — best stubs
fast< 1 sYes, approximateLarge workspaces, time-sensitive
ctagsFastestNoneItems-only analysis

Source slicing (stable, no nightly)

cargo-slicer                # slice all deps
cargo-slicer regex          # slice one crate
cargo-slicer --clean        # clean and re-slice
cargo-slicer -O             # fast production mode (skip verification)
cargo-slicer build --release  # slice + build

Optimization levels

LevelDescription
-O0No deletion — safe baseline
-O1Delete private functions (with verification)
-O2Delete all private items + trial deletion
-O3Graph-guided deletion (default)
-OFast production — skip verification

Environment Variables

Core

VariableDefaultDescription
CARGO_SLICER_VIRTUALunsetSet to 1 to enable virtual slicing
CARGO_SLICER_CODEGEN_FILTERunsetSet to 1 to skip CGUs containing only stubs
RUSTC_WRAPPERunsetSet to path of cargo_slicer_dispatch
CARGO_SLICER_DRIVERunsetSet to path of cargo-slicer-rustc

Cross-crate analysis

VariableDefaultDescription
CARGO_SLICER_CROSS_CRATEunsetSet to 1 to enable cross-crate analysis
CARGO_SLICER_PARSERsynPre-analysis backend: syn, fast, or ctags

MIR-precise analysis

VariableDefaultDescription
CARGO_SLICER_MIR_PRECISEunsetSet to 1 for MIR-level whole-program analysis
CARGO_SLICER_WORKSPACE_CRATESunsetComma-separated list of workspace crates to harvest

Performance tuning

VariableDefaultDescription
CARGO_SLICER_SKIP_THRESHOLDautoSkip driver when predicted stubs < threshold. auto = skip 0-stub crates; 0/never = never skip
CARGO_SLICER_DAEMONunsetSet to 1 to enable fork-server (amortises 300 ms driver load)
CARGO_SLICER_SCCACHEautoPath to sccache, or /nonexistent to disable
CARGO_SLICER_RELAX_UNSAFEunsetSet to 1 to allow stubbing unsafe fn

Caching

VariableDefaultDescription
CARGO_SLICER_CACHE_DIR.slicer-cacheDirectory for incremental cache files
CARGO_SLICER_NO_CACHEunsetSet to 1 to disable caching entirely

Debugging

VariableDefaultDescription
CARGO_SLICER_DEBUGunsetSet to 1 to enable debug logging
CARGO_SLICER_DEBUG_LOG.cargo-slicer-debug.logCustom path for debug log
CARGO_SLICER_MARKED_OUTunsetWrite marked items to a file for inspection

Troubleshooting

RUSTC_WRAPPER breaks building cargo-slicer itself

Symptom: cargo install --path . fails with mysterious compilation errors.

Cause: RUSTC_WRAPPER=cargo_slicer_dispatch is set in your environment from a previous virtual-slicing session. It intercepts compilation of cargo-slicer's own dependencies.

Fix: Unset it before building cargo-slicer:

unset RUSTC_WRAPPER CARGO_SLICER_VIRTUAL CARGO_SLICER_CODEGEN_FILTER
cargo install --path .

Only set RUSTC_WRAPPER when building your target project.

sccache permission errors on WSL /mnt/ drives

Symptom: failed to set permissions errors during cargo install on /mnt/c/ or /mnt/d/.

Cause: NTFS does not support Unix file permissions. sccache creates files with Unix permissions that NTFS cannot store.

Fix: Build on the native Linux filesystem:

cd ~
git clone https://github.com/yijunyu/cargo-slicer
cd cargo-slicer
SCCACHE_IDLE_TIMEOUT=0 cargo install --path .

Or, if you must stay on the Windows drive, disable sccache entirely:

SCCACHE_IDLE_TIMEOUT=0 cargo install --path .
# and when building your project:
CARGO_SLICER_SCCACHE=/nonexistent RUSTC_WRAPPER=... cargo +nightly build --release

Stale .slicer-cache/ after updating the driver

Symptom: unexpected stub failures or missed stubs after upgrading cargo-slicer.

Fix: Delete the cache:

rm -rf .slicer-cache/

Nightly toolchain mismatch

Symptom: cargo-slicer-rustc crashes at startup with a rustc_private ABI error.

Cause: The driver binary was compiled against a different nightly than the one currently active.

Fix: Rebuild the driver against the active nightly:

rustup update nightly
cargo +nightly install --path . --profile release-rustc \
  --features rustc-driver \
  --bin cargo-slicer-rustc \
  --bin cargo_slicer_dispatch

Build succeeds but no speedup

Likely causes:

  1. Check .cargo-slicer-debug.log for skip-driver markers — all crates skipped means the threshold is too aggressive. Fix: CARGO_SLICER_SKIP_THRESHOLD=0.

  2. The project is a library crate with no binary entry point. The slicer is most effective on binary crates with deep dependency trees.

  3. Pre-analysis was not run. Run cargo-slicer pre-analyze first.

Benchmarks

All numbers are cold builds (after cargo clean) on a 48-core Linux server with nightly Rust.

Virtual slicer — rust-perf standard suite (not yet re-verified)

These single-crate numbers were measured without -Z threads=8 or the wild linker. They have not been re-verified with the current fair-RUSTFLAGS protocol and may overstate speedups (same apples-to-oranges issue as the retracted workspace numbers above).

ProjectBaselinecargo-slicerSpeedup
image 0.25.6 (lib)40,742 ms1,461 ms27.9×
ripgrep 14.1.1 (bin)24,094 ms5,891 ms4.09×
cargo 0.87.1 (workspace)133,797 ms61,922 ms2.16×
diesel 2.2.10 (lib)25,854 ms14,339 ms1.80×
syn 2.0.101 (lib)6,711 ms4,157 ms1.61×
serde 1.0.219 (lib)3,951 ms3,966 ms1.00×

serde is already minimal — almost all of its code is reachable via derive macros. The slicer correctly identifies this.

Virtual slicer — real binary projects

All measurements use identical RUSTFLAGS for both baseline and vslice-cc (-Z threads=8 -C linker=clang -C link-arg=--ld-path=wild). 48-core machine, Apr 2026, 2–3 runs per mode.

ProjectBaselinevslice-ccSpeedupNotes
helix (16 local crates)68 s44 s1.55×
ripgrep (50K LOC)10.5 s7 s1.50×
zed (209 local crates)1098 s767 s1.43×76 driver, 131 skip
zeroclaw (4 local crates)686 s522 s1.31×3,786 stubs / ~241k mono items (1.6% overall, 4.4% bin)
nushell (41 local crates)103 s82 s1.26×

Retracted claims: nushell was reported at 5.1× — apples-to-oranges RUSTFLAGS mismatch; honest speedup is 1.26×. cargo-slicer (self) was claimed at 1.74× but re-verified at 1.00× (only 1 driver crate, 0 stubs).

Docker benchmarks (docker run cargo-slicer bench)

Fair comparison inside Docker: same nightly toolchain, cargo fetch before timing (excludes download time), cargo clean between baseline and slicer. Slicer timing includes cargo-slicer pre-analyze overhead.

ProjectBaselineSlicerSpeedup
zed (209 crates)1149 s545 s2.11×
helix (16 crates)95 s59 s1.61×
zeroclaw (4 crates)842 s542 s1.55×
ripgrep (17 crates)15 s12 s1.31×
nushell (41 crates)118 s94 s1.25×

Docker speedups are higher than bare-metal for large projects (zed 2.11× vs 1.43×) because fewer cores amplify the benefit of eliminating codegen work — less parallelism means each eliminated function saves more wall time.

# Run the benchmark yourself
docker build -t cargo-slicer .
docker run --rm -v /path/to/project:/workspace/project cargo-slicer bench

Warm-cache daemon — verified (Apr 2026)

Both baseline and warmed use nightly + -Z threads=8. Interleaved rounds, dispatch pre-warmed, rm -rf target/ before each run.

CrateBaselineWarmedSpeedup
image 0.254.9 s2.1 s2.3×
syn 2.01.0 s0.66 s1.5×

An earlier version of this table claimed 8.5× for image (40.7 s → 4.8 s) and 1.7× for syn (6.7 s → 4.0 s). Those baselines were measured without -Z threads=8 and the wild linker, while the warmed runs had them — the same apples-to-oranges error as the nushell 5.1×. cargo 0.87.1 (claimed 2.3×) is a regression with fair RUSTFLAGS: baseline 15 s vs warmed 64 s — dispatch overhead serializes what -Z threads=8 parallelizes across 48 cores.

A warm cache populated by one project is reused across all projects on the same machine.

Upstream -Z dead-fn-elimination patch

ProjectBaseline-Z dead-fn-eliminationReduction
zed1,790 s1,238 s−31%, 9.2 min saved
rustc336 s176 s−48%, 2.7 min saved
ripgrep13 s13 sbreak-even (all fns reachable)

C/C++ projects — clang-daemon PCH acceleration

build-accelerate.sh (included in the image) auto-detects C/C++ projects and injects a precompiled header via clang-daemon. The technique eliminates repeated header parsing across parallel compilation units.

Already benchmarked (48-core server, Clang 21, -j48):

ProjectStarsFilesBaselineAcceleratedSpeedupNotes
Linux kernel 6.14227k26,339~890 s~730 s1.22×GCC fallback for asm-heavy files
LLVM 20~2,873measuredmeasured1.22×Clang 21 compiling Clang 20
LLVM 21~2,873measuredmeasured1.24×Self-hosted build
vim~300baselineaccelerated1.3×Small project, overhead minimal
sqlite31 (amalgam)20 s20.2 s1.01×Single-file; PCH gives nothing

Predicted speedup for top starred projects (based on file count × header density model):

RankProjectStarsLangFilesLOCBuildPredictedReason
1Linux227kC26,339~20MMake1.2× ✅ benchmarked
2TensorFlow195kC++~650~2.5MBazel/CMake1.15–1.25×Heavy STL + proto headers
3Godot109kC++~3,500~8.6MSCons1.2–1.3×Large header graph
4Electron121kC++(Chromium)~25Mninja1.2×Chromium-scale header reuse
5OpenCV87kC++~1,000~600KCMake1.15–1.2×Dense OpenCV headers
6FFmpeg58kC~500~1Mautotools1.1–1.2×libav* headers per file
7Bitcoin89kC++~500~750KCMake1.1–1.2×Boost + secp256k1 headers
8Netdata78kC~700~700KCMake1.1–1.15×Moderate header depth
9Redis74kC~250~330KMake1.05–1.1×Shallow headers, small codebase
10Git60kC~400~140KMake1.05–1.1×Minimal headers
llama.cpp102kC++~150~250KCMake1.05×Small; GGML headers not dense
sqlite3C1~255KMake≈1×Amalgamation; no parallelism

Key insight: speedup scales with (files × header parse fraction). Projects with thousands of files each including the same heavyweight headers (Linux, Godot, TensorFlow, Chromium) get the most benefit. Single-file amalgamations (sqlite3) and projects with shallow headers (Redis, Git) get little to none.

To run against any of these projects:

# Clone and accelerate (auto-detects C/C++ via compile_commands.json or Makefile)
git clone https://github.com/torvalds/linux
build-accelerate.sh ./linux

# Or via Docker (mounts your checkout)
docker run --rm --cpus=48 \
  -v $(pwd)/linux:/workspace/project \
  ghcr.io/yijunyu/cargo-slicer:latest

For projects using SCons (Godot) or Bazel (TensorFlow), generate compile_commands.json first:

# Godot
scons compiledb
# TensorFlow (CMake path)
cmake -DCMAKE_EXPORT_COMPILE_COMMANDS=ON -B build && cp build/compile_commands.json .

Running benchmarks yourself

# Multi-crate CI benchmark (7 projects, baseline vs vslice-cc, 3 runs each)
./scripts/ci_bench_multicrate.sh

# Individual project
./scripts/bench_fresh_build.sh nushell baseline 3
./scripts/bench_fresh_build.sh nushell vslice-cc 3

# RL training KPI report
cargo-slicer rl-bench --project /tmp/your-project --runs 2

Results are stored in bench-results.db (SQLite).

RL Training KPIs

Code RL systems use compilation success as the reward signal. For Rust projects, compile time is 70–90% of the rollout phase — the main bottleneck of the training loop. cargo-slicer rl-bench translates compile speedup into the KPI language used by MLOps teams.

Usage

# Measure current project (2 cold builds per mode)
cargo-slicer rl-bench

# Custom options
cargo-slicer rl-bench --runs 3 --rollout-fraction 0.85 \
  --gpus 16 --project /tmp/your-project

# Persist to bench-results.db
cargo-slicer rl-bench --db bench-results.db

KPIs reported

KPI 1 — Cold-build throughput (samples/hour)

samples/hour = 3600 / compile_time_seconds

KPI 2 — Incremental feedback latency

Time from a one-line edit to the first cargo check result.

KPI 3 — Compute cost per valid sample

cost = compile_time / pass_rate

KPI 4 — Cluster-hour equivalent

How many RL samples fit in one GPU-cluster-hour at a given rollout fraction.

Example output (nushell, 1.26× speedup)

Numbers below are nushell — verified Apr 2026 with identical RUSTFLAGS for both modes (-Z threads=8, wild linker). An earlier version of this example claimed 5.1× for nushell; that was an apples-to-oranges comparison where the baseline lacked the parallel frontend and fast linker. The honest speedup is 1.26× (103 s → 82 s).

  KPI 1 — Cold-Build Throughput (samples/hour)
    Baseline   :   103.0s  →      34 samples/hr
    cargo-slicer:   82.0s  →      43 samples/hr  (1.26× faster)

  KPI 2 — Incremental Feedback Latency (cargo check)
    Baseline   :    12.4s  →     290 feedback-loops/hr
    cargo-slicer:    4.1s  →     878 feedback-loops/hr  (3.0× faster)

  Cluster-Hour Equivalent (8 GPUs, 80% rollout fraction)
    Baseline   :     272 samples / cluster-hour
    cargo-slicer:    344 samples / cluster-hour  (1.26× more data)

Persisting results

Results are written to the rl_kpi table in bench-results.db:

SELECT project, baseline_cold_secs, slicer_cold_secs, speedup,
       slicer_throughput_per_hr, ts
FROM rl_kpi ORDER BY ts DESC LIMIT 10;

Blog

A three-part series on why Rust builds are slow and how cargo-slicer closes the gap.

Speeding Up Rust Builds: Part I — The Waiting Game

Part I of III. Part II: The Gap | Part III: Closing the Gap


17 Minutes of Your Life, Gone

Let's talk about Zed.

Zed is a gorgeous code editor written in Rust. Fast. Sleek. Modern. The kind of project that makes you proud to be a Rust developer.

Now try building it from source:

$ time cargo build --release
...
Finished `release` profile in 16m 52s

Seventeen minutes.

You start the build. You check your email. You make coffee. You drink the coffee. You check Reddit. You wonder if you chose the wrong career. The build finishes. You realize you had a typo. You start again.

This isn't a Zed problem. This is a Rust problem. Or rather, a big Rust project problem. Zed has over 500,000 lines of code across 198 workspace crates. That's a lot of Rust for the compiler to chew through.

But surely we can do better, right? The Rust community has been optimizing the compiler for years. Let's try everything.

Attempt 1: Parallel Frontend

Rust nightly has a parallel frontend. More threads, more speed. Simple.

RUSTFLAGS="-Z threads=8" cargo +nightly build --release

Result: the build gets maybe 5-10% faster. Nice, but we're still waiting 15 minutes. The parallel frontend helps with parsing and type checking, but the real time sink is LLVM codegen — and that's already parallelized per codegen unit.

Attempt 2: Faster Linker

Linking takes time. Let's use wild, a fast linker written in Rust:

RUSTFLAGS="-C link-arg=-fuse-ld=wild" cargo +nightly build --release

Result: linking goes from ~8 seconds to ~3 seconds. Great for linking. But linking is less than 1% of the total build time. We've saved 5 seconds out of 1,012. The bottleneck isn't linking.

Attempt 3: Compilation Caching

sccache caches compiled crates, so rebuilds are faster:

RUSTC_WRAPPER=sccache cargo build --release

Result: the second build is blazingly fast. But the first build — a clean, fresh build — is exactly the same. And in CI, every build is a fresh build. Your new developer's first git clone && cargo build? Fresh build. Switching branches with incompatible deps? Fresh build.

Caching doesn't reduce the work. It just remembers it for next time.

Attempt 4: Cranelift Backend

What if we skip LLVM entirely? The Cranelift backend compiles much faster:

RUSTFLAGS="-Z codegen-backend=cranelift" cargo +nightly build

Result: significantly faster compilation. But the output isn't optimized. Cranelift is great for development builds, but for release builds — the ones your users run, the ones CI produces — you want LLVM's optimizations. We need --release to be fast.

Attempt 5: Profile-Guided Optimization of rustc

The Rust project already ships a PGO-optimized compiler. Years of work have gone into making rustc itself faster. The nightly you're using right now benefits from all of that.

And yet, here we are. Seventeen minutes.

The Honest Question

So let me ask you something uncomfortable.

We've tried the parallel frontend. We've tried faster linkers. We've tried caching. We've tried alternative backends. We've tried optimizing the compiler itself.

What if the compiler is already doing its job well? What if the problem isn't how the compiler compiles, but what we're asking it to compile?

Think about it. When you cargo build --release on Zed, the compiler dutifully compiles every public function in every library crate. The regex crate exposes dozens of functions — your project calls maybe three. The serde crate has hundreds of methods — you use a fraction. The compiler doesn't know this. It can't. It's compiling each crate in isolation, and any public function might be called from downstream.

What if a significant chunk of the compiler's work is simply... unnecessary?

What if we could tell the compiler, before it even starts, "hey, you don't need to bother with these 9,000 functions"?

That would be interesting.

To be continued in Part II: The Gap...


This is Part I of a three-part series on cargo-slicer, a tool for speeding up Rust release builds. Part II introduces the "separate compilation gap" and measures just how much work is wasted. Part III shows how to close the gap.

Speeding Up Rust Builds: Part II — The Gap

Part II of III. Part I: The Waiting Game | Part III: Closing the Gap


Previously

In Part I, we tried every known trick to speed up building Zed — parallel frontend, fast linker, caching, alternative backends. Nothing made a dent on that 17-minute clean release build. We ended with a question: what if the problem isn't how the compiler works, but what we're asking it to compile?

Let's find out.

A Library's Dilemma

Consider a library crate — say, serde_json. It exposes a rich API: from_str(), from_slice(), from_reader(), to_string(), to_string_pretty(), to_vec(), to_writer(), and dozens more.

Your project calls serde_json::from_str() and serde_json::to_string(). That's it. Two functions.

But when rustc compiles serde_json, it doesn't know you only need two functions. It can't. The crate boundary is opaque — rustc compiles each crate independently, treating every public function as a potential entry point. It must generate optimized machine code for all of them.

This isn't a bug. It's how separate compilation works. It's a fundamental architectural decision that enables crates to be compiled independently, cached, and reused. It's the right design.

But it has a cost.

The Separate Compilation Gap

We call this cost the separate compilation gap: the difference between what the compiler must compile (everything visible) and what the program actually needs (everything reachable from main).

Formally, for a compilation unit u:

Gap(u) = (|Visible(u)| - |Reachable(u)|) / |Visible(u)|

Where:

  • Visible(u) = all symbols the compiler processes (every public function, every impl, every trait method)
  • Reachable(u) = the subset actually reachable from main() via whole-program call graph analysis

If Gap = 0%, the compiler is doing exactly the right amount of work. If Gap = 50%, half the compiler's effort is wasted.

Measuring Zed's Gap

So what's Zed's gap?

We built a tool that does whole-program reachability analysis across all 198 workspace crates. Starting from main(), it traces every function call, every trait method invocation, every generic instantiation, and marks what's actually needed.

Then we count: how many CPU instructions does the compiler execute with everything vs. only the reachable code?

TotalReachableGap
CPU instructions28,559 Ginstr18,067 Ginstr37%
Functions analyzed32,57923,09529%

37% of the CPU instructions the compiler executes when building Zed are spent compiling code that no one will ever call.

Let that sink in. More than a third of the compiler's work is wasted. That's not a rounding error. That's not a micro-optimization waiting to happen. That's 10 trillion CPU instructions, burned for nothing, on every clean build.

And this isn't just Zed:

ProjectLOCInstructions (Base)Instructions (Reachable)Gap
zed500K28,559 Ginstr18,067 Ginstr37%
rustc600K5,746 Ginstr4,268 Ginstr26%
zeroclaw86K1,507 Ginstr1,314 Ginstr13%
helix100K2,256 Ginstr2,004 Ginstr11%
ripgrep50K314 Ginstr298 Ginstr5%
nushell200K3,695 Ginstr3,682 Ginstr0.4%
bevy300K3,807 Ginstr3,791 Ginstr0.4%

Some projects have tiny gaps. Bevy and nushell use almost everything they import — good for them. But Zed has a 37% gap, rustc has 26%, and even zeroclaw (a smaller project) wastes 13%.

Why Some Gaps Are Bigger

The gap depends on how a project uses its dependencies.

Large gap projects like Zed have many library crates with broad APIs, but the binary only touches a fraction. Zed pulls in hundreds of crates for its editor, terminal, collaboration, and AI features. Each crate is compiled in full, even though Zed's binary only uses specific code paths.

Small gap projects like bevy use their dependencies more thoroughly. A game engine that imports a math library probably uses most of the math functions. There's less waste.

There's also an interesting amplification effect. In Rust, generics are monomorphized — each generic function gets compiled once per concrete type it's used with. When you stub an unreachable function, you also eliminate all its downstream monomorphizations. That's why Zed's instruction gap (37%) is larger than its function gap (29%) — each stubbed function cascades into many eliminated monomorphizations.

The Honest Assessment

Here's the uncomfortable truth: the Rust compiler isn't slow. It's doing too much work.

And it's doing too much work because separate compilation — the very architecture that makes Cargo fast for incremental builds and enables the crates ecosystem — prevents the compiler from knowing what's actually needed.

Link-Time Optimization (LTO) can eliminate dead code after compilation, but it doesn't reduce the compilation phase itself. The work has already been done.

What we need is something that works before compilation. Something that tells the compiler, at the crate boundary, "here's exactly which public functions are actually called from downstream — you can skip the rest."

So How Do We Close the Gap?

We know the gap exists. We can measure it precisely. For Zed, 37% of the compiler's work is provably unnecessary.

The question is: can we build a tool that, operating purely as a RUSTC_WRAPPER with no compiler modifications, identifies unreachable functions and eliminates them before LLVM codegen?

And can we do it without breaking anything?

To be continued in Part III: Closing the Gap...


This is Part II of a three-part series on cargo-slicer. Part I set up the problem. Part III provides the solution.

Speeding Up Rust Builds: Part III — Closing the Gap

Part III of III. Part I: The Waiting Game | Part II: The Gap


Previously

In Part I, we saw that building Zed takes 17 minutes and no existing optimization really helps. In Part II, we discovered why: 37% of the compiler's work is spent on unreachable code — the "separate compilation gap."

Now let's close it.

The Approach: Four Steps

The idea is simple in principle: figure out what's reachable from main() across all crates, then tell the compiler to skip everything else. In practice, there are a few details to get right.

We call this approach PRECC (Predictive Precompilation Cutting), and it works in four phases:

Step 1: Extract

Before compilation starts, we scan all workspace crate sources and build a unified cross-crate call graph. For Rust, we use a syn-based parser that extracts function definitions, call sites, and public API surfaces from every .rs file. This takes a few seconds, even for large projects.

cargo-slicer pre-analyze    # builds the cross-crate call graph

For Zed, this produces a graph covering all 198 workspace crates: which functions exist, which functions call which, and which are publicly exported.

Step 2: Analyze

Starting from main(), we run a BFS (breadth-first search) through the call graph. Every function reachable from main is marked as "needed." Everything else is marked as "unreachable."

We're careful about special cases. Drop implementations? Always needed (the compiler inserts drop calls implicitly). Trait implementations? Always needed (dynamic dispatch via dyn Trait can call them). #[no_mangle] FFI functions? Always needed. Closures, async functions, unsafe functions? Always needed. We maintain 9 categories of exclusions to be safe.

The result: a precise set of functions that can be safely eliminated from each crate.

Step 3: Predict

Here's where it gets interesting. Naively, you'd think "just cut everything unreachable." But our analysis has overhead — loading the driver, traversing MIR, doing cache I/O. For crates with very few unreachable functions, this overhead exceeds the savings.

We learned this the hard way. Applying cutting to every crate in bevy slows the build by 4.4%. The gap is only 0.4%, and the analysis overhead eats the tiny savings.

So for each crate, we predict: will cutting save more time than the analysis costs? If yes, cut. If no, skip — compile it normally.

Our baseline heuristic is simple:

  • If the predicted number of stubbable functions is 0: skip.
  • If it's less than 5 AND the stub ratio is under 2%: skip.
  • Otherwise: cut.

This gets us 92-100% precision on projects that benefit, and correctly skips projects where cutting would hurt.

Step 4: Cut

For crates marked "cut," we intercept the compiler. Operating as a RUSTC_WRAPPER, we hook into rustc after type checking and replace unreachable function bodies with MIR-level abort stubs. The function signature remains (so downstream crates can still reference it), but the body is replaced with a single abort() instruction.

When rustc's monomorphization collector encounters a stubbed function, it finds no callees — no downstream functions, no generic instantiations, nothing to compile. The entire subtree of the mono graph is pruned. LLVM never sees it.

No source code is modified. No Cargo.toml changes. No feature flags. The compiler simply does less work.

# The full command
cargo-slicer pre-analyze
CARGO_SLICER_VIRTUAL=1 CARGO_SLICER_CODEGEN_FILTER=1 \
  RUSTC_WRAPPER=$(which cargo_slicer_dispatch) \
  cargo +nightly build --release

Or, even simpler:

cargo-slicer.sh /path/to/your/project

The Results

So how much faster is Zed?

ProjectBaselineWith PRECCWall-clockInstructionsPeak Memory
zed1,012s719s-29%-37%-45%
rustc135.8s112.4s-17%-26%-7%
zeroclaw192.9s170.4s-12%-13%-11%
helix71.2s66.6s-6%-11%-18%
ripgrep11.1s10.7s-4%-5%-6%
nushell106.5s108.9s+2.3%-0.4%
bevy81.8s85.4s+4.4%-0.4%

Zed's build drops from 17 minutes to 12 minutes. That's 5 minutes saved on every clean build. 45% less memory. 37% fewer CPU instructions.

The Rust compiler itself builds 17% faster. Helix, 6%. Ripgrep, 4%.

And look at the last two rows. Nushell and bevy have tiny gaps (0.4%), so the prediction step correctly identifies them as not worth cutting. Without prediction, bevy would be 4.4% slower — the overhead exceeds the savings. With prediction, we avoid that regression entirely.

The Honest Part

Let me be upfront about what this tool doesn't do:

  • Incremental builds: cargo-slicer targets fresh/clean builds. For incremental cargo check and small changes, rustc's built-in incremental compilation is already fast. We're solving the CI/fresh-build problem.
  • Small projects: if your project is 5,000 lines with 3 dependencies, the gap is tiny and the overhead isn't worth it. This tool shines on larger codebases (50K+ LOC).
  • Correctness guarantee: we replace function bodies with abort stubs. If our reachability analysis is wrong and a "stubbed" function gets called at runtime, the program will abort. In practice, our 9 safety exclusion categories prevent this — we've tested on all benchmark projects — but it's worth knowing.
  • Nightly only: the MIR-level hooks require unstable rustc APIs, so a nightly toolchain is required.

Try It

Install with one command:

curl -fsSL https://raw.githubusercontent.com/yijunyu/cargo-slicer/main/install.sh | bash

Then build any Rust project:

cargo-slicer.sh /path/to/your/project

That's it. No config files, no source changes, no Cargo.toml edits. Point it at any Rust project with a Cargo.toml and see what happens.

What We'd Love to Hear

We're researchers, not fortune tellers. The benchmark numbers above are from our test machine (48-core, 128 GB RAM, Linux). Your mileage will vary depending on your project's dependency structure, your hardware, and the phase of the moon.

We genuinely want to know how this works on your project. Does it speed things up? Does it break something? Is the gap large or small? Every data point helps us improve.

Reach out:

  • GitHub Issues: github.com/yijunyu/cargo-slicer/issues — bug reports, benchmark results, feature requests
  • Email: yijun.yu@open.ac.uk — for detailed results, collaboration, or just to say hello

We're particularly interested in projects with 10+ workspace crates and heavy dependency usage — that's where the gap tends to be largest.

The Bigger Picture

The separate compilation gap isn't unique to Rust. We've also applied the same principle to C projects — splitting SQLite's monolithic 256K-line sqlite3.c into 2,503 independent compilation units, achieving a 5.8x speedup via parallelism.

The gap is a property of separate compilation itself, not of any particular language or compiler. Wherever a compiler processes code in isolation without knowing what's actually needed, there's potential waste.

And wherever there's waste, there's opportunity.


This concludes our three-part series on speeding up Rust builds with cargo-slicer.

Thanks for reading. Now go build something — a little faster.


Links: