cargo-slicer
Rust builds are slow. cargo-slicer makes them fast.
Two complementary techniques work together:
| Technique | What it does | Typical gain |
|---|---|---|
| Virtual Slicer | Stubs unreachable functions at the MIR level so LLVM never sees them | 1.2–1.5× per workspace |
| Warm-Cache Daemon | Pre-compiles registry crates once, serves cached .rlib files on every subsequent build | skips 100% of registry compilation |
You do not need to understand the internals to use them. The all-in-one script runs the full pipeline in one command.
Real-World Results
Verified benchmarks (Apr 2026, host-native, no warm cache)
Both baseline and vslice-cc use identical RUSTFLAGS (-Z threads=8, wild linker).
2–3 runs per mode, 48-core machine.
| Project | Baseline | vslice-cc | Speedup |
|---|---|---|---|
| helix (16 crates) | 68 s | 44 s | 1.55× |
| ripgrep (50K LOC) | 10.5 s | 7 s | 1.50× |
| zed (209 crates) | 1098 s | 767 s | 1.43× |
| zeroclaw (4 crates) | 686 s | 522 s | 1.31× |
| nushell (41 crates) | 103 s | 82 s | 1.26× |
Docker image (with pre-warmed registry cache)
| Project | Baseline | build-slicer | Speedup |
|---|---|---|---|
| zeroclaw (4 crates) | 794 s | 547 s | 1.45× |
Registry-cache speedups (warm-cache daemon alone, verified Apr 2026)
Both baseline and warmed use nightly + -Z threads=8. Interleaved rounds,
dispatch pre-warmed, rm -rf target/ before each run.
| Crate | Baseline | Warmed | Speedup |
|---|---|---|---|
| image 0.25 | 4.9 s | 2.1 s | 2.3× |
| syn 2.0 | 1.0 s | 0.66 s | 1.5× |
An earlier version of this table claimed 8.5× for image (40.7 s → 4.8 s) and 1.7× for syn (6.7 s → 4.0 s). Those baselines were measured without
-Z threads=8and the wild linker, while the warmed runs had them — the same apples-to-oranges error as the nushell 5.1×. cargo 0.87.1 (claimed 2.3×) is a regression with fair RUSTFLAGS (dispatch overhead serializes the parallel build).
Requirements
- Rust stable (source slicing, warmup CLI)
- Rust nightly (virtual slicer — requires
rustc-driverfeature) - Linux, macOS, or Windows (WSL recommended on Windows)
Getting Started
Docker (quickest start — no installation needed)
Pull the pre-built image and run it against any Rust project:
docker run --rm --cpus=48 \
-v $(pwd):/workspace/project \
ghcr.io/yijunyu/cargo-slicer:latest
The image includes all binaries (cargo-slicer-rustc, cargo_warmup_pch, etc.) and a
pre-warmed registry cache. --cpus=48 ensures the container uses all available cores.
Replace 48 with the output of nproc on your machine. Verified on zeroclaw:
1.45× speedup (794 s → 547 s) vs plain cargo build --release.
First run: the container runs
cargo-slicer pre-analyzeautomatically if no.slicer-cache/directory is found, then builds with the full 3-layer pipeline.
Install
# Stable binary (source slicing + warmup CLI)
cargo install cargo-slicer
# Nightly driver (virtual slicer — the fast path)
cargo +nightly install cargo-slicer \
--features rustc-driver \
--bin cargo-slicer-rustc \
--bin cargo_slicer_dispatch
If you are building from source:
git clone https://github.com/yijunyu/cargo-slicer
cd cargo-slicer
cargo install --path .
cargo +nightly install --path . --profile release-rustc \
--features rustc-driver \
--bin cargo-slicer-rustc \
--bin cargo_slicer_dispatch
WSL on Windows drives (
/mnt/c/,/mnt/d/): prefix everycargo installwithSCCACHE_IDLE_TIMEOUT=0to avoid NTFS permission errors. See Troubleshooting.
Quickstart: one command
Run the full pipeline against your project:
cd your-project
cargo-slicer.sh .
This runs four steps automatically:
- Warm the registry cache (
cargo-warmup init --tier=1) - Pre-analyze the workspace call graph
- Plan the critical compilation path
- Build with the three-layer
RUSTC_WRAPPERchain
On the first run the warmup step takes ~10 minutes (it compiles the top-tier registry crates once). Every subsequent cold build is served from cache.
Manual setup (step by step)
If you prefer to control each step:
# Step 1: warm the registry cache (one-time, ~10 min)
cargo-warmup init --tier=1
# Step 2: pre-analyze the workspace (seconds)
cd your-project
cargo-slicer pre-analyze
# Step 3: build with virtual slicing
cargo clean
CARGO_SLICER_VIRTUAL=1 CARGO_SLICER_CODEGEN_FILTER=1 \
RUSTC_WRAPPER=$(which cargo_slicer_dispatch) \
cargo +nightly build --release
Important: never set
RUSTC_WRAPPERwhen building cargo-slicer itself. Unset it before runningcargo install --path ..
How It Works
Rust builds are slow for a structural reason: rustc compiles each crate in
isolation. When it compiles a library crate it cannot know which of its public
functions will be called by downstream code, so it compiles all of them. In a
large workspace most of that work is wasted.
cargo-slicer attacks the problem from two angles simultaneously.
The two techniques
Virtual Slicer
The virtual slicer inserts itself as a RUSTC_WRAPPER. Before each crate is
compiled it performs a reachability analysis — a BFS starting from the entry
point of the binary — and replaces every unreachable function body with an
abort() stub. LLVM never sees those functions, so it never optimises,
inlines, or emits machine code for them.
For a crate like image, which exposes hundreds of format decoders and pixel
converters, a typical application uses one or two formats. The slicer stubs the
rest. LLVM's work drops by 97%.
Warm-Cache Daemon
Registry crates (crates.io dependencies) do not change between builds. The
warm-cache daemon pre-compiles them once and stores the resulting .rlib and
.rmeta artefacts. On every subsequent build, instead of re-running rustc,
the RUSTC_WRAPPER copies the cached artefact into target/ in milliseconds.
The cache key is SHA256(crate + version + rustc_version + features + opt_level),
so the cached artefact is safe to share across projects and across git branches.
A warmed cache built while compiling zed is reused immediately when compiling
nushell.
The three-layer pipeline
When both techniques run together, three wrappers are chained:
cargo build --release
│
▼ RUSTC_WRAPPER = cargo_warmup_dispatch
│ ├─ registry crate? → serve from cache, return immediately
│ └─ local crate? → pass to next wrapper
│
▼ CARGO_WARMUP_INNER_WRAPPER = cargo_slicer_dispatch
│ ├─ no unreachable fns (cache hit)? → pass to real rustc directly
│ └─ has unreachable fns? → pass to driver
│
▼ CARGO_SLICER_DRIVER = cargo-slicer-rustc
└─ MIR analysis → stub unreachable fns → LLVM codegen on minimum set
The cargo-slicer.sh script sets up this chain automatically.
Virtual Slicer
The virtual slicer is a RUSTC_WRAPPER that stubs unreachable functions at the
MIR level before LLVM sees them. It does not modify your source files or your
Cargo.toml.
What "unreachable" means
Starting from the binary's main function, cargo-slicer traces the call graph
across all workspace crates. Any function that cannot be reached from main is
replaced with an abort() body. LLVM skips compilation, optimisation, and code
emission for those functions entirely.
The analysis is conservative: trait impls, generics, async functions, closures, and any function called through a function pointer are always kept.
How it plugs into cargo
RUSTC_WRAPPER=cargo_slicer_dispatch ← stable binary, < 1 ms startup
│
└─ local workspace crate?
CARGO_SLICER_DRIVER=cargo-slicer-rustc ← nightly driver, ~300 ms startup
└─ BFS reachability analysis
└─ MIR stub replacement
└─ CGU filtering (skip codegen units with only stubs)
The dispatch binary keeps the nightly driver out of the fast path. Registry crates (which change rarely and are cached) never pay the 300 ms driver load.
Cross-crate pre-analysis
For accurate reachability across crate boundaries, run pre-analysis before the build:
cargo-slicer pre-analyze
This uses syn-based parsing to build a call graph across all workspace crates
in seconds, writing results to .slicer-cache/. The driver reads these files at
build time instead of re-analysing from scratch for every crate.
Without pre-analysis the slicer falls back to conservative per-crate analysis, which still works but produces fewer stubs.
Tuning
| Environment variable | Effect |
|---|---|
CARGO_SLICER_VIRTUAL=1 | Enable virtual slicing |
CARGO_SLICER_CODEGEN_FILTER=1 | Skip CGUs that contain only stubs |
CARGO_SLICER_DEBUG=1 | Write a debug log to .cargo-slicer-debug.log |
CARGO_SLICER_SKIP_THRESHOLD=auto | Skip driver for crates with no predicted stubs (default) |
CARGO_SLICER_SKIP_THRESHOLD=0 | Always load the driver for every local crate |
What cannot be stubbed
The slicer never stubs:
- Trait impl associated functions (vtable entries)
- Generic functions (monomorphised at the call site)
async fnand closuresunsafe fn(unlessCARGO_SLICER_RELAX_UNSAFE=1)- Any function reachable through a function pointer
These constraints are intentional. Stubbing them would either cause linker errors or produce incorrect binaries.
Upstream proposal
The virtual slicing logic has been extracted into a proposed rustc patch behind
a -Z dead-fn-elimination flag. If accepted upstream, the install story becomes:
RUSTFLAGS="-Z dead-fn-elimination" cargo +nightly build --release
No extra binary, no nightly ABI compatibility shims.
Warm-Cache Daemon
The warm-cache daemon (also called cargo-warmup) pre-compiles registry
crates once and serves the cached .rlib / .rmeta artefacts on every
subsequent build. It is the Rust equivalent of a precompiled-header daemon for
C/C++.
The insight
Registry crates do not change between your builds. syn, serde, tokio,
proc-macro2 — these are compiled identically every time you run cargo clean && cargo build. Compilation caches like sccache help on the second build,
but every fresh environment (new developer, CI machine, Docker container) pays
the full cost again.
The warm-cache daemon shifts that cost to a one-time investment. Pre-warm the top-tier registry crates once (takes ~10 minutes). Every cold build afterwards — in any project that depends on those crates — skips their compilation entirely.
Cache key
The cache key is:
SHA256(crate_name + version + rustc_version + edition + features + opt_level)
-C metadata and -C extra-filename are excluded. These differ per project but
do not affect the correctness of the compiled artefact. Excluding them is what
enables cross-project sharing: the .rlib compiled while building zed is
reused directly when building nushell.
Usage
# One-time warm (adds ~10 min, saves that time on every cold build after)
cargo-warmup init --tier=1
# Check cache status
cargo-warmup status
# Use in builds
RUSTC_WRAPPER=$(which cargo_warmup_dispatch) cargo +nightly build --release
cargo-slicer.sh runs cargo-warmup init --tier=1 automatically on first use.
Tiers
| Tier | Crates included | Warm time |
|---|---|---|
--tier=1 | proc-macro2, quote, syn, serde, tokio, + 5 more core crates | ~20 s |
--tier=2 | + 50 most common transitive deps | ~3 min |
--tier=3 | All crates.io top-500 | ~10 min |
Tier 1 gives the best return on investment for most projects. Tier 3 is useful in CI environments where build time is money.
How it plugs into cargo
RUSTC_WRAPPER=cargo_warmup_dispatch
│
├─ cache hit? → copy .rlib to target/, return in < 1 ms
└─ cache miss? → invoke real rustc, store result in cache
The dispatch binary adds less than 1 ms per crate invocation on a cache hit.
Sharing the cache across projects
By default the cache lives in ~/.cargo/warmup-cache/. Any project on the same
machine with matching crate versions and rustc toolchain automatically benefits
from a warm cache built by any other project.
To inspect what is cached:
cargo-warmup status
# or
sqlite3 ~/.cargo/warmup-cache/index.db \
'SELECT crate, version, cached_at FROM artefacts ORDER BY cached_at DESC LIMIT 20'
The All-in-One Script
cargo-slicer.sh runs the full four-step pipeline automatically.
cargo-slicer.sh /path/to/your/project
# or, from inside the project:
cargo-slicer.sh .
Pass extra cargo build arguments after the project path:
cargo-slicer.sh . --features my-feature
cargo-slicer.sh . --no-default-features
What it does
Step 0 — Warm the registry cache
cargo-warmup init --tier=1
Skipped if the cache is already warm. On first run this takes ~10–20 seconds for tier-1 (the 10 most common registry crates).
Step 1 — Pre-analyze the workspace call graph
cargo-slicer pre-analyze
Builds a cross-crate call graph using syn-based static analysis. Writes
.slicer-cache/*.analysis and .slicer-cache/*.seeds. Takes 0.5 s (ripgrep)
to 12 s (zed).
Step 2 — Plan the critical path
cargo-warmup pch-plan
Schedules crate compilation in an order that minimises the critical path, so parallelism is maximised across the three-layer wrapper chain.
Step 3 — Build with the wrapper chain
RUSTC_WRAPPER=cargo_warmup_dispatch \
CARGO_WARMUP_INNER_WRAPPER=cargo_slicer_dispatch \
CARGO_SLICER_VIRTUAL=1 \
CARGO_SLICER_CODEGEN_FILTER=1 \
CARGO_SLICER_DRIVER=$(which cargo-slicer-rustc) \
cargo +nightly build --release "$@"
The three-layer chain:
cargo_warmup_dispatch— serves registry crates from cache (< 1 ms each)cargo_slicer_dispatch— routes local crates to the driver or real rustccargo-slicer-rustc— stubs unreachable functions, filters CGUs
Installation
cargo-slicer.sh is installed alongside the binary:
cargo install cargo-slicer
which cargo-slicer.sh # → ~/.cargo/bin/cargo-slicer.sh
Or, from a source checkout:
./cargo-slicer.sh . # runs directly from the repo
Usage Reference
Virtual Slicing (recommended)
Linux / macOS
# Install nightly driver (one-time)
cargo +nightly install cargo-slicer \
--features rustc-driver \
--bin cargo-slicer-rustc \
--bin cargo_slicer_dispatch
# Pre-analyze workspace call graph (seconds)
cargo-slicer pre-analyze
# Build with virtual slicing
cargo clean
CARGO_SLICER_VIRTUAL=1 CARGO_SLICER_CODEGEN_FILTER=1 \
RUSTC_WRAPPER=$(which cargo_slicer_dispatch) \
cargo +nightly build --release
WSL on Windows drives (/mnt/c/, /mnt/d/, …)
Same as above but disable sccache to avoid NTFS permission errors:
SCCACHE_IDLE_TIMEOUT=0 cargo +nightly install cargo-slicer \
--features rustc-driver \
--bin cargo-slicer-rustc \
--bin cargo_slicer_dispatch
cargo-slicer pre-analyze
cargo clean
CARGO_SLICER_VIRTUAL=1 CARGO_SLICER_CODEGEN_FILTER=1 \
CARGO_SLICER_SCCACHE=/nonexistent \
RUSTC_WRAPPER=$(which cargo_slicer_dispatch) \
cargo +nightly build --release
Subcommands
| Subcommand | Description |
|---|---|
| (default) | Source slicing: copy deps, delete unused items |
build [ARGS] | Slice deps then build with sliced crates |
pre-analyze [--parser BACKEND] | Cross-crate static analysis for virtual slicing |
generate [-o DIR] [--delete] | Write a sliced source copy without modifying the original |
rl-bench [OPTIONS] | Measure compile speedup as RL training KPIs |
Pre-analysis parser backends
cargo-slicer pre-analyze # syn (default, most accurate)
cargo-slicer pre-analyze --parser fast # fast tokenizer
cargo-slicer pre-analyze --parser ctags # items only, no call edges
| Backend | Speed | Call edges | Use when |
|---|---|---|---|
syn | 0.5–12 s | Yes, accurate | Default — best stubs |
fast | < 1 s | Yes, approximate | Large workspaces, time-sensitive |
ctags | Fastest | None | Items-only analysis |
Source slicing (stable, no nightly)
cargo-slicer # slice all deps
cargo-slicer regex # slice one crate
cargo-slicer --clean # clean and re-slice
cargo-slicer -O # fast production mode (skip verification)
cargo-slicer build --release # slice + build
Optimization levels
| Level | Description |
|---|---|
-O0 | No deletion — safe baseline |
-O1 | Delete private functions (with verification) |
-O2 | Delete all private items + trial deletion |
-O3 | Graph-guided deletion (default) |
-O | Fast production — skip verification |
Environment Variables
Core
| Variable | Default | Description |
|---|---|---|
CARGO_SLICER_VIRTUAL | unset | Set to 1 to enable virtual slicing |
CARGO_SLICER_CODEGEN_FILTER | unset | Set to 1 to skip CGUs containing only stubs |
RUSTC_WRAPPER | unset | Set to path of cargo_slicer_dispatch |
CARGO_SLICER_DRIVER | unset | Set to path of cargo-slicer-rustc |
Cross-crate analysis
| Variable | Default | Description |
|---|---|---|
CARGO_SLICER_CROSS_CRATE | unset | Set to 1 to enable cross-crate analysis |
CARGO_SLICER_PARSER | syn | Pre-analysis backend: syn, fast, or ctags |
MIR-precise analysis
| Variable | Default | Description |
|---|---|---|
CARGO_SLICER_MIR_PRECISE | unset | Set to 1 for MIR-level whole-program analysis |
CARGO_SLICER_WORKSPACE_CRATES | unset | Comma-separated list of workspace crates to harvest |
Performance tuning
| Variable | Default | Description |
|---|---|---|
CARGO_SLICER_SKIP_THRESHOLD | auto | Skip driver when predicted stubs < threshold. auto = skip 0-stub crates; 0/never = never skip |
CARGO_SLICER_DAEMON | unset | Set to 1 to enable fork-server (amortises 300 ms driver load) |
CARGO_SLICER_SCCACHE | auto | Path to sccache, or /nonexistent to disable |
CARGO_SLICER_RELAX_UNSAFE | unset | Set to 1 to allow stubbing unsafe fn |
Caching
| Variable | Default | Description |
|---|---|---|
CARGO_SLICER_CACHE_DIR | .slicer-cache | Directory for incremental cache files |
CARGO_SLICER_NO_CACHE | unset | Set to 1 to disable caching entirely |
Debugging
| Variable | Default | Description |
|---|---|---|
CARGO_SLICER_DEBUG | unset | Set to 1 to enable debug logging |
CARGO_SLICER_DEBUG_LOG | .cargo-slicer-debug.log | Custom path for debug log |
CARGO_SLICER_MARKED_OUT | unset | Write marked items to a file for inspection |
Troubleshooting
RUSTC_WRAPPER breaks building cargo-slicer itself
Symptom: cargo install --path . fails with mysterious compilation errors.
Cause: RUSTC_WRAPPER=cargo_slicer_dispatch is set in your environment from
a previous virtual-slicing session. It intercepts compilation of cargo-slicer's
own dependencies.
Fix: Unset it before building cargo-slicer:
unset RUSTC_WRAPPER CARGO_SLICER_VIRTUAL CARGO_SLICER_CODEGEN_FILTER
cargo install --path .
Only set RUSTC_WRAPPER when building your target project.
sccache permission errors on WSL /mnt/ drives
Symptom: failed to set permissions errors during cargo install on /mnt/c/ or /mnt/d/.
Cause: NTFS does not support Unix file permissions. sccache creates files with Unix permissions that NTFS cannot store.
Fix: Build on the native Linux filesystem:
cd ~
git clone https://github.com/yijunyu/cargo-slicer
cd cargo-slicer
SCCACHE_IDLE_TIMEOUT=0 cargo install --path .
Or, if you must stay on the Windows drive, disable sccache entirely:
SCCACHE_IDLE_TIMEOUT=0 cargo install --path .
# and when building your project:
CARGO_SLICER_SCCACHE=/nonexistent RUSTC_WRAPPER=... cargo +nightly build --release
Stale .slicer-cache/ after updating the driver
Symptom: unexpected stub failures or missed stubs after upgrading cargo-slicer.
Fix: Delete the cache:
rm -rf .slicer-cache/
Nightly toolchain mismatch
Symptom: cargo-slicer-rustc crashes at startup with a rustc_private ABI error.
Cause: The driver binary was compiled against a different nightly than the one currently active.
Fix: Rebuild the driver against the active nightly:
rustup update nightly
cargo +nightly install --path . --profile release-rustc \
--features rustc-driver \
--bin cargo-slicer-rustc \
--bin cargo_slicer_dispatch
Build succeeds but no speedup
Likely causes:
-
Check
.cargo-slicer-debug.logforskip-drivermarkers — all crates skipped means the threshold is too aggressive. Fix:CARGO_SLICER_SKIP_THRESHOLD=0. -
The project is a library crate with no binary entry point. The slicer is most effective on binary crates with deep dependency trees.
-
Pre-analysis was not run. Run
cargo-slicer pre-analyzefirst.
Benchmarks
All numbers are cold builds (after cargo clean) on a 48-core Linux server
with nightly Rust.
Virtual slicer — rust-perf standard suite (not yet re-verified)
These single-crate numbers were measured without -Z threads=8 or the wild
linker. They have not been re-verified with the current fair-RUSTFLAGS
protocol and may overstate speedups (same apples-to-oranges issue as the
retracted workspace numbers above).
| Project | Baseline | cargo-slicer | Speedup |
|---|---|---|---|
| image 0.25.6 (lib) | 40,742 ms | 1,461 ms | 27.9× |
| ripgrep 14.1.1 (bin) | 24,094 ms | 5,891 ms | 4.09× |
| cargo 0.87.1 (workspace) | 133,797 ms | 61,922 ms | 2.16× |
| diesel 2.2.10 (lib) | 25,854 ms | 14,339 ms | 1.80× |
| syn 2.0.101 (lib) | 6,711 ms | 4,157 ms | 1.61× |
| serde 1.0.219 (lib) | 3,951 ms | 3,966 ms | 1.00× |
serde is already minimal — almost all of its code is reachable via derive
macros. The slicer correctly identifies this.
Virtual slicer — real binary projects
All measurements use identical RUSTFLAGS for both baseline and vslice-cc
(-Z threads=8 -C linker=clang -C link-arg=--ld-path=wild). 48-core machine,
Apr 2026, 2–3 runs per mode.
| Project | Baseline | vslice-cc | Speedup | Notes |
|---|---|---|---|---|
| helix (16 local crates) | 68 s | 44 s | 1.55× | |
| ripgrep (50K LOC) | 10.5 s | 7 s | 1.50× | |
| zed (209 local crates) | 1098 s | 767 s | 1.43× | 76 driver, 131 skip |
| zeroclaw (4 local crates) | 686 s | 522 s | 1.31× | 3,786 stubs / ~241k mono items (1.6% overall, 4.4% bin) |
| nushell (41 local crates) | 103 s | 82 s | 1.26× |
Retracted claims: nushell was reported at 5.1× — apples-to-oranges RUSTFLAGS mismatch; honest speedup is 1.26×. cargo-slicer (self) was claimed at 1.74× but re-verified at 1.00× (only 1 driver crate, 0 stubs).
Docker benchmarks (docker run cargo-slicer bench)
Fair comparison inside Docker: same nightly toolchain, cargo fetch before
timing (excludes download time), cargo clean between baseline and slicer.
Slicer timing includes cargo-slicer pre-analyze overhead.
| Project | Baseline | Slicer | Speedup |
|---|---|---|---|
| zed (209 crates) | 1149 s | 545 s | 2.11× |
| helix (16 crates) | 95 s | 59 s | 1.61× |
| zeroclaw (4 crates) | 842 s | 542 s | 1.55× |
| ripgrep (17 crates) | 15 s | 12 s | 1.31× |
| nushell (41 crates) | 118 s | 94 s | 1.25× |
Docker speedups are higher than bare-metal for large projects (zed 2.11× vs 1.43×) because fewer cores amplify the benefit of eliminating codegen work — less parallelism means each eliminated function saves more wall time.
# Run the benchmark yourself
docker build -t cargo-slicer .
docker run --rm -v /path/to/project:/workspace/project cargo-slicer bench
Warm-cache daemon — verified (Apr 2026)
Both baseline and warmed use nightly + -Z threads=8. Interleaved rounds,
dispatch pre-warmed, rm -rf target/ before each run.
| Crate | Baseline | Warmed | Speedup |
|---|---|---|---|
| image 0.25 | 4.9 s | 2.1 s | 2.3× |
| syn 2.0 | 1.0 s | 0.66 s | 1.5× |
An earlier version of this table claimed 8.5× for image (40.7 s → 4.8 s) and 1.7× for syn (6.7 s → 4.0 s). Those baselines were measured without
-Z threads=8and the wild linker, while the warmed runs had them — the same apples-to-oranges error as the nushell 5.1×. cargo 0.87.1 (claimed 2.3×) is a regression with fair RUSTFLAGS: baseline 15 s vs warmed 64 s — dispatch overhead serializes what-Z threads=8parallelizes across 48 cores.
A warm cache populated by one project is reused across all projects on the same machine.
Upstream -Z dead-fn-elimination patch
| Project | Baseline | -Z dead-fn-elimination | Reduction |
|---|---|---|---|
| zed | 1,790 s | 1,238 s | −31%, 9.2 min saved |
| rustc | 336 s | 176 s | −48%, 2.7 min saved |
| ripgrep | 13 s | 13 s | break-even (all fns reachable) |
C/C++ projects — clang-daemon PCH acceleration
build-accelerate.sh (included in the image) auto-detects C/C++ projects and
injects a precompiled header via clang-daemon. The technique eliminates
repeated header parsing across parallel compilation units.
Already benchmarked (48-core server, Clang 21, -j48):
| Project | Stars | Files | Baseline | Accelerated | Speedup | Notes |
|---|---|---|---|---|---|---|
| Linux kernel 6.14 | 227k | 26,339 | ~890 s | ~730 s | 1.22× | GCC fallback for asm-heavy files |
| LLVM 20 | — | ~2,873 | measured | measured | 1.22× | Clang 21 compiling Clang 20 |
| LLVM 21 | — | ~2,873 | measured | measured | 1.24× | Self-hosted build |
| vim | — | ~300 | baseline | accelerated | 1.3× | Small project, overhead minimal |
| sqlite3 | — | 1 (amalgam) | 20 s | 20.2 s | 1.01× | Single-file; PCH gives nothing |
Predicted speedup for top starred projects (based on file count × header density model):
| Rank | Project | Stars | Lang | Files | LOC | Build | Predicted | Reason |
|---|---|---|---|---|---|---|---|---|
| 1 | Linux | 227k | C | 26,339 | ~20M | Make | 1.2× ✅ benchmarked | |
| 2 | TensorFlow | 195k | C++ | ~650 | ~2.5M | Bazel/CMake | 1.15–1.25× | Heavy STL + proto headers |
| 3 | Godot | 109k | C++ | ~3,500 | ~8.6M | SCons | 1.2–1.3× | Large header graph |
| 4 | Electron | 121k | C++ | (Chromium) | ~25M | ninja | 1.2× | Chromium-scale header reuse |
| 5 | OpenCV | 87k | C++ | ~1,000 | ~600K | CMake | 1.15–1.2× | Dense OpenCV headers |
| 6 | FFmpeg | 58k | C | ~500 | ~1M | autotools | 1.1–1.2× | libav* headers per file |
| 7 | Bitcoin | 89k | C++ | ~500 | ~750K | CMake | 1.1–1.2× | Boost + secp256k1 headers |
| 8 | Netdata | 78k | C | ~700 | ~700K | CMake | 1.1–1.15× | Moderate header depth |
| 9 | Redis | 74k | C | ~250 | ~330K | Make | 1.05–1.1× | Shallow headers, small codebase |
| 10 | Git | 60k | C | ~400 | ~140K | Make | 1.05–1.1× | Minimal headers |
| — | llama.cpp | 102k | C++ | ~150 | ~250K | CMake | 1.05× | Small; GGML headers not dense |
| — | sqlite3 | — | C | 1 | ~255K | Make | ≈1× | Amalgamation; no parallelism |
Key insight: speedup scales with (files × header parse fraction). Projects with thousands of files each including the same heavyweight headers (Linux, Godot, TensorFlow, Chromium) get the most benefit. Single-file amalgamations (sqlite3) and projects with shallow headers (Redis, Git) get little to none.
To run against any of these projects:
# Clone and accelerate (auto-detects C/C++ via compile_commands.json or Makefile)
git clone https://github.com/torvalds/linux
build-accelerate.sh ./linux
# Or via Docker (mounts your checkout)
docker run --rm --cpus=48 \
-v $(pwd)/linux:/workspace/project \
ghcr.io/yijunyu/cargo-slicer:latest
For projects using SCons (Godot) or Bazel (TensorFlow), generate
compile_commands.jsonfirst:# Godot scons compiledb # TensorFlow (CMake path) cmake -DCMAKE_EXPORT_COMPILE_COMMANDS=ON -B build && cp build/compile_commands.json .
Running benchmarks yourself
# Multi-crate CI benchmark (7 projects, baseline vs vslice-cc, 3 runs each)
./scripts/ci_bench_multicrate.sh
# Individual project
./scripts/bench_fresh_build.sh nushell baseline 3
./scripts/bench_fresh_build.sh nushell vslice-cc 3
# RL training KPI report
cargo-slicer rl-bench --project /tmp/your-project --runs 2
Results are stored in bench-results.db (SQLite).
RL Training KPIs
Code RL systems use compilation success as the reward signal. For Rust projects,
compile time is 70–90% of the rollout phase — the main bottleneck of the
training loop. cargo-slicer rl-bench translates compile speedup into the KPI
language used by MLOps teams.
Usage
# Measure current project (2 cold builds per mode)
cargo-slicer rl-bench
# Custom options
cargo-slicer rl-bench --runs 3 --rollout-fraction 0.85 \
--gpus 16 --project /tmp/your-project
# Persist to bench-results.db
cargo-slicer rl-bench --db bench-results.db
KPIs reported
KPI 1 — Cold-build throughput (samples/hour)
samples/hour = 3600 / compile_time_seconds
KPI 2 — Incremental feedback latency
Time from a one-line edit to the first cargo check result.
KPI 3 — Compute cost per valid sample
cost = compile_time / pass_rate
KPI 4 — Cluster-hour equivalent
How many RL samples fit in one GPU-cluster-hour at a given rollout fraction.
Example output (nushell, 1.26× speedup)
Numbers below are nushell — verified Apr 2026 with identical RUSTFLAGS for
both modes (-Z threads=8, wild linker). An earlier version of this example
claimed 5.1× for nushell; that was an apples-to-oranges comparison where the
baseline lacked the parallel frontend and fast linker. The honest speedup is
1.26× (103 s → 82 s).
KPI 1 — Cold-Build Throughput (samples/hour)
Baseline : 103.0s → 34 samples/hr
cargo-slicer: 82.0s → 43 samples/hr (1.26× faster)
KPI 2 — Incremental Feedback Latency (cargo check)
Baseline : 12.4s → 290 feedback-loops/hr
cargo-slicer: 4.1s → 878 feedback-loops/hr (3.0× faster)
Cluster-Hour Equivalent (8 GPUs, 80% rollout fraction)
Baseline : 272 samples / cluster-hour
cargo-slicer: 344 samples / cluster-hour (1.26× more data)
Persisting results
Results are written to the rl_kpi table in bench-results.db:
SELECT project, baseline_cold_secs, slicer_cold_secs, speedup,
slicer_throughput_per_hr, ts
FROM rl_kpi ORDER BY ts DESC LIMIT 10;
Blog
A three-part series on why Rust builds are slow and how cargo-slicer closes the gap.
- Part I: The Waiting Game — why the usual tricks don't work
- Part II: The Gap — measuring exactly how much work is wasted
- Part III: Closing the Gap — how the virtual slicer works
Speeding Up Rust Builds: Part I — The Waiting Game
Part I of III. Part II: The Gap | Part III: Closing the Gap
17 Minutes of Your Life, Gone
Let's talk about Zed.
Zed is a gorgeous code editor written in Rust. Fast. Sleek. Modern. The kind of project that makes you proud to be a Rust developer.
Now try building it from source:
$ time cargo build --release
...
Finished `release` profile in 16m 52s
Seventeen minutes.
You start the build. You check your email. You make coffee. You drink the coffee. You check Reddit. You wonder if you chose the wrong career. The build finishes. You realize you had a typo. You start again.
This isn't a Zed problem. This is a Rust problem. Or rather, a big Rust project problem. Zed has over 500,000 lines of code across 198 workspace crates. That's a lot of Rust for the compiler to chew through.
But surely we can do better, right? The Rust community has been optimizing the compiler for years. Let's try everything.
Attempt 1: Parallel Frontend
Rust nightly has a parallel frontend. More threads, more speed. Simple.
RUSTFLAGS="-Z threads=8" cargo +nightly build --release
Result: the build gets maybe 5-10% faster. Nice, but we're still waiting 15 minutes. The parallel frontend helps with parsing and type checking, but the real time sink is LLVM codegen — and that's already parallelized per codegen unit.
Attempt 2: Faster Linker
Linking takes time. Let's use wild, a fast linker written in Rust:
RUSTFLAGS="-C link-arg=-fuse-ld=wild" cargo +nightly build --release
Result: linking goes from ~8 seconds to ~3 seconds. Great for linking. But linking is less than 1% of the total build time. We've saved 5 seconds out of 1,012. The bottleneck isn't linking.
Attempt 3: Compilation Caching
sccache caches compiled crates, so rebuilds are faster:
RUSTC_WRAPPER=sccache cargo build --release
Result: the second build is blazingly fast. But the first build — a clean, fresh build — is exactly the same. And in CI, every build is a fresh build. Your new developer's first git clone && cargo build? Fresh build. Switching branches with incompatible deps? Fresh build.
Caching doesn't reduce the work. It just remembers it for next time.
Attempt 4: Cranelift Backend
What if we skip LLVM entirely? The Cranelift backend compiles much faster:
RUSTFLAGS="-Z codegen-backend=cranelift" cargo +nightly build
Result: significantly faster compilation. But the output isn't optimized. Cranelift is great for development builds, but for release builds — the ones your users run, the ones CI produces — you want LLVM's optimizations. We need --release to be fast.
Attempt 5: Profile-Guided Optimization of rustc
The Rust project already ships a PGO-optimized compiler. Years of work have gone into making rustc itself faster. The nightly you're using right now benefits from all of that.
And yet, here we are. Seventeen minutes.
The Honest Question
So let me ask you something uncomfortable.
We've tried the parallel frontend. We've tried faster linkers. We've tried caching. We've tried alternative backends. We've tried optimizing the compiler itself.
What if the compiler is already doing its job well? What if the problem isn't how the compiler compiles, but what we're asking it to compile?
Think about it. When you cargo build --release on Zed, the compiler dutifully compiles every public function in every library crate. The regex crate exposes dozens of functions — your project calls maybe three. The serde crate has hundreds of methods — you use a fraction. The compiler doesn't know this. It can't. It's compiling each crate in isolation, and any public function might be called from downstream.
What if a significant chunk of the compiler's work is simply... unnecessary?
What if we could tell the compiler, before it even starts, "hey, you don't need to bother with these 9,000 functions"?
That would be interesting.
To be continued in Part II: The Gap...
This is Part I of a three-part series on cargo-slicer, a tool for speeding up Rust release builds. Part II introduces the "separate compilation gap" and measures just how much work is wasted. Part III shows how to close the gap.
Speeding Up Rust Builds: Part II — The Gap
Part II of III. Part I: The Waiting Game | Part III: Closing the Gap
Previously
In Part I, we tried every known trick to speed up building Zed — parallel frontend, fast linker, caching, alternative backends. Nothing made a dent on that 17-minute clean release build. We ended with a question: what if the problem isn't how the compiler works, but what we're asking it to compile?
Let's find out.
A Library's Dilemma
Consider a library crate — say, serde_json. It exposes a rich API: from_str(), from_slice(), from_reader(), to_string(), to_string_pretty(), to_vec(), to_writer(), and dozens more.
Your project calls serde_json::from_str() and serde_json::to_string(). That's it. Two functions.
But when rustc compiles serde_json, it doesn't know you only need two functions. It can't. The crate boundary is opaque — rustc compiles each crate independently, treating every public function as a potential entry point. It must generate optimized machine code for all of them.
This isn't a bug. It's how separate compilation works. It's a fundamental architectural decision that enables crates to be compiled independently, cached, and reused. It's the right design.
But it has a cost.
The Separate Compilation Gap
We call this cost the separate compilation gap: the difference between what the compiler must compile (everything visible) and what the program actually needs (everything reachable from main).
Formally, for a compilation unit u:
Gap(u) = (|Visible(u)| - |Reachable(u)|) / |Visible(u)|
Where:
- Visible(u) = all symbols the compiler processes (every public function, every impl, every trait method)
- Reachable(u) = the subset actually reachable from
main()via whole-program call graph analysis
If Gap = 0%, the compiler is doing exactly the right amount of work. If Gap = 50%, half the compiler's effort is wasted.
Measuring Zed's Gap
So what's Zed's gap?
We built a tool that does whole-program reachability analysis across all 198 workspace crates. Starting from main(), it traces every function call, every trait method invocation, every generic instantiation, and marks what's actually needed.
Then we count: how many CPU instructions does the compiler execute with everything vs. only the reachable code?
| Total | Reachable | Gap | |
|---|---|---|---|
| CPU instructions | 28,559 Ginstr | 18,067 Ginstr | 37% |
| Functions analyzed | 32,579 | 23,095 | 29% |
37% of the CPU instructions the compiler executes when building Zed are spent compiling code that no one will ever call.
Let that sink in. More than a third of the compiler's work is wasted. That's not a rounding error. That's not a micro-optimization waiting to happen. That's 10 trillion CPU instructions, burned for nothing, on every clean build.
And this isn't just Zed:
| Project | LOC | Instructions (Base) | Instructions (Reachable) | Gap |
|---|---|---|---|---|
| zed | 500K | 28,559 Ginstr | 18,067 Ginstr | 37% |
| rustc | 600K | 5,746 Ginstr | 4,268 Ginstr | 26% |
| zeroclaw | 86K | 1,507 Ginstr | 1,314 Ginstr | 13% |
| helix | 100K | 2,256 Ginstr | 2,004 Ginstr | 11% |
| ripgrep | 50K | 314 Ginstr | 298 Ginstr | 5% |
| nushell | 200K | 3,695 Ginstr | 3,682 Ginstr | 0.4% |
| bevy | 300K | 3,807 Ginstr | 3,791 Ginstr | 0.4% |
Some projects have tiny gaps. Bevy and nushell use almost everything they import — good for them. But Zed has a 37% gap, rustc has 26%, and even zeroclaw (a smaller project) wastes 13%.
Why Some Gaps Are Bigger
The gap depends on how a project uses its dependencies.
Large gap projects like Zed have many library crates with broad APIs, but the binary only touches a fraction. Zed pulls in hundreds of crates for its editor, terminal, collaboration, and AI features. Each crate is compiled in full, even though Zed's binary only uses specific code paths.
Small gap projects like bevy use their dependencies more thoroughly. A game engine that imports a math library probably uses most of the math functions. There's less waste.
There's also an interesting amplification effect. In Rust, generics are monomorphized — each generic function gets compiled once per concrete type it's used with. When you stub an unreachable function, you also eliminate all its downstream monomorphizations. That's why Zed's instruction gap (37%) is larger than its function gap (29%) — each stubbed function cascades into many eliminated monomorphizations.
The Honest Assessment
Here's the uncomfortable truth: the Rust compiler isn't slow. It's doing too much work.
And it's doing too much work because separate compilation — the very architecture that makes Cargo fast for incremental builds and enables the crates ecosystem — prevents the compiler from knowing what's actually needed.
Link-Time Optimization (LTO) can eliminate dead code after compilation, but it doesn't reduce the compilation phase itself. The work has already been done.
What we need is something that works before compilation. Something that tells the compiler, at the crate boundary, "here's exactly which public functions are actually called from downstream — you can skip the rest."
So How Do We Close the Gap?
We know the gap exists. We can measure it precisely. For Zed, 37% of the compiler's work is provably unnecessary.
The question is: can we build a tool that, operating purely as a RUSTC_WRAPPER with no compiler modifications, identifies unreachable functions and eliminates them before LLVM codegen?
And can we do it without breaking anything?
To be continued in Part III: Closing the Gap...
This is Part II of a three-part series on cargo-slicer. Part I set up the problem. Part III provides the solution.
Speeding Up Rust Builds: Part III — Closing the Gap
Part III of III. Part I: The Waiting Game | Part II: The Gap
Previously
In Part I, we saw that building Zed takes 17 minutes and no existing optimization really helps. In Part II, we discovered why: 37% of the compiler's work is spent on unreachable code — the "separate compilation gap."
Now let's close it.
The Approach: Four Steps
The idea is simple in principle: figure out what's reachable from main() across all crates, then tell the compiler to skip everything else. In practice, there are a few details to get right.
We call this approach PRECC (Predictive Precompilation Cutting), and it works in four phases:
Step 1: Extract
Before compilation starts, we scan all workspace crate sources and build a unified cross-crate call graph. For Rust, we use a syn-based parser that extracts function definitions, call sites, and public API surfaces from every .rs file. This takes a few seconds, even for large projects.
cargo-slicer pre-analyze # builds the cross-crate call graph
For Zed, this produces a graph covering all 198 workspace crates: which functions exist, which functions call which, and which are publicly exported.
Step 2: Analyze
Starting from main(), we run a BFS (breadth-first search) through the call graph. Every function reachable from main is marked as "needed." Everything else is marked as "unreachable."
We're careful about special cases. Drop implementations? Always needed (the compiler inserts drop calls implicitly). Trait implementations? Always needed (dynamic dispatch via dyn Trait can call them). #[no_mangle] FFI functions? Always needed. Closures, async functions, unsafe functions? Always needed. We maintain 9 categories of exclusions to be safe.
The result: a precise set of functions that can be safely eliminated from each crate.
Step 3: Predict
Here's where it gets interesting. Naively, you'd think "just cut everything unreachable." But our analysis has overhead — loading the driver, traversing MIR, doing cache I/O. For crates with very few unreachable functions, this overhead exceeds the savings.
We learned this the hard way. Applying cutting to every crate in bevy slows the build by 4.4%. The gap is only 0.4%, and the analysis overhead eats the tiny savings.
So for each crate, we predict: will cutting save more time than the analysis costs? If yes, cut. If no, skip — compile it normally.
Our baseline heuristic is simple:
- If the predicted number of stubbable functions is 0: skip.
- If it's less than 5 AND the stub ratio is under 2%: skip.
- Otherwise: cut.
This gets us 92-100% precision on projects that benefit, and correctly skips projects where cutting would hurt.
Step 4: Cut
For crates marked "cut," we intercept the compiler. Operating as a RUSTC_WRAPPER, we hook into rustc after type checking and replace unreachable function bodies with MIR-level abort stubs. The function signature remains (so downstream crates can still reference it), but the body is replaced with a single abort() instruction.
When rustc's monomorphization collector encounters a stubbed function, it finds no callees — no downstream functions, no generic instantiations, nothing to compile. The entire subtree of the mono graph is pruned. LLVM never sees it.
No source code is modified. No Cargo.toml changes. No feature flags. The compiler simply does less work.
# The full command
cargo-slicer pre-analyze
CARGO_SLICER_VIRTUAL=1 CARGO_SLICER_CODEGEN_FILTER=1 \
RUSTC_WRAPPER=$(which cargo_slicer_dispatch) \
cargo +nightly build --release
Or, even simpler:
cargo-slicer.sh /path/to/your/project
The Results
So how much faster is Zed?
| Project | Baseline | With PRECC | Wall-clock | Instructions | Peak Memory |
|---|---|---|---|---|---|
| zed | 1,012s | 719s | -29% | -37% | -45% |
| rustc | 135.8s | 112.4s | -17% | -26% | -7% |
| zeroclaw | 192.9s | 170.4s | -12% | -13% | -11% |
| helix | 71.2s | 66.6s | -6% | -11% | -18% |
| ripgrep | 11.1s | 10.7s | -4% | -5% | -6% |
| nushell | 106.5s | 108.9s | +2.3% | -0.4% | — |
| bevy | 81.8s | 85.4s | +4.4% | -0.4% | — |
Zed's build drops from 17 minutes to 12 minutes. That's 5 minutes saved on every clean build. 45% less memory. 37% fewer CPU instructions.
The Rust compiler itself builds 17% faster. Helix, 6%. Ripgrep, 4%.
And look at the last two rows. Nushell and bevy have tiny gaps (0.4%), so the prediction step correctly identifies them as not worth cutting. Without prediction, bevy would be 4.4% slower — the overhead exceeds the savings. With prediction, we avoid that regression entirely.
The Honest Part
Let me be upfront about what this tool doesn't do:
- Incremental builds: cargo-slicer targets fresh/clean builds. For incremental
cargo checkand small changes, rustc's built-in incremental compilation is already fast. We're solving the CI/fresh-build problem. - Small projects: if your project is 5,000 lines with 3 dependencies, the gap is tiny and the overhead isn't worth it. This tool shines on larger codebases (50K+ LOC).
- Correctness guarantee: we replace function bodies with abort stubs. If our reachability analysis is wrong and a "stubbed" function gets called at runtime, the program will abort. In practice, our 9 safety exclusion categories prevent this — we've tested on all benchmark projects — but it's worth knowing.
- Nightly only: the MIR-level hooks require unstable rustc APIs, so a nightly toolchain is required.
Try It
Install with one command:
curl -fsSL https://raw.githubusercontent.com/yijunyu/cargo-slicer/main/install.sh | bash
Then build any Rust project:
cargo-slicer.sh /path/to/your/project
That's it. No config files, no source changes, no Cargo.toml edits. Point it at any Rust project with a Cargo.toml and see what happens.
What We'd Love to Hear
We're researchers, not fortune tellers. The benchmark numbers above are from our test machine (48-core, 128 GB RAM, Linux). Your mileage will vary depending on your project's dependency structure, your hardware, and the phase of the moon.
We genuinely want to know how this works on your project. Does it speed things up? Does it break something? Is the gap large or small? Every data point helps us improve.
Reach out:
- GitHub Issues: github.com/yijunyu/cargo-slicer/issues — bug reports, benchmark results, feature requests
- Email: yijun.yu@open.ac.uk — for detailed results, collaboration, or just to say hello
We're particularly interested in projects with 10+ workspace crates and heavy dependency usage — that's where the gap tends to be largest.
The Bigger Picture
The separate compilation gap isn't unique to Rust. We've also applied the same principle to C projects — splitting SQLite's monolithic 256K-line sqlite3.c into 2,503 independent compilation units, achieving a 5.8x speedup via parallelism.
The gap is a property of separate compilation itself, not of any particular language or compiler. Wherever a compiler processes code in isolation without knowing what's actually needed, there's potential waste.
And wherever there's waste, there's opportunity.
This concludes our three-part series on speeding up Rust builds with cargo-slicer.
Thanks for reading. Now go build something — a little faster.
Links: