cargo-slicer

Rust builds are slow. cargo-slicer makes them fast.

Two complementary techniques work together:

Technique	What it does	Typical gain
Virtual Slicer	Stubs unreachable functions at the MIR level so LLVM never sees them	1.2–1.5× per workspace
Warm-Cache Daemon	Pre-compiles registry crates once, serves cached `.rlib` files on every subsequent build	skips 100% of registry compilation

You do not need to understand the internals to use them. The all-in-one script runs the full pipeline in one command.

Real-World Results

ASE 2026 corpus sweep — 2,669 crates, zero regressions

Independent validation on the top crates by downloads from crates.io. Numbers are split by crate kind because the in-tree -Z dead-fn-elimination flag is a no-op on libraries today (per V10/V11 review, see ASE 2026 Corpus):

Subset	Crates	Both legs built	Slicer-only failures	Median speedup
Binaries (in-tree `-Z`)	65	59	0	1.38×
Libraries (userspace tool only)	2,538	2,393	0	1.50×

Zero slicer-only correctness regressions across both subsets. Details and full per-crate catalog: ASE 2026 Corpus.

Verified benchmarks (Apr 2026, host-native, no warm cache)

Both baseline and vslice-cc use identical RUSTFLAGS (-Z threads=8, wild linker). 2–3 runs per mode, 48-core machine.

Project	Baseline	vslice-cc	Speedup
helix (16 crates)	68 s	44 s	1.55×
ripgrep (50K LOC)	10.5 s	7 s	1.50×
zed (209 crates)	1098 s	767 s	1.43×
zeroclaw (4 crates)	686 s	522 s	1.31×
nushell (41 crates)	103 s	82 s	1.26×

Docker image (with pre-warmed registry cache)

Project	Baseline	build-slicer	Speedup
zeroclaw (4 crates)	794 s	547 s	1.45×

Registry-cache speedups (warm-cache daemon alone, verified Apr 2026)

Both baseline and warmed use nightly + -Z threads=8. Interleaved rounds, dispatch pre-warmed, rm -rf target/ before each run.

Crate	Baseline	Warmed	Speedup
image 0.25	4.9 s	2.1 s	2.3×
syn 2.0	1.0 s	0.66 s	1.5×

An earlier version of this table claimed 8.5× for image (40.7 s → 4.8 s) and 1.7× for syn (6.7 s → 4.0 s). Those baselines were measured without -Z threads=8 and the wild linker, while the warmed runs had them — the same apples-to-oranges error as the nushell 5.1×. cargo 0.87.1 (claimed 2.3×) is a regression with fair RUSTFLAGS (dispatch overhead serializes the parallel build).

Requirements

Rust stable (source slicing, warmup CLI)
Rust nightly (virtual slicer — requires rustc-driver feature)
Linux, macOS, or Windows (WSL recommended on Windows)

Getting Started

Docker (quickest start — no installation needed)

Pull the pre-built image and run it against any Rust project:

docker run --rm --cpus=48 \
  -v $(pwd):/workspace/project \
  ghcr.io/yijunyu/cargo-slicer:latest

The image includes all binaries (cargo-slicer-rustc, cargo_warmup_pch, etc.) and a pre-warmed registry cache. --cpus=48 ensures the container uses all available cores. Replace 48 with the output of nproc on your machine. Verified on zeroclaw: 1.45× speedup (794 s → 547 s) vs plain cargo build --release.

First run: the container runs cargo-slicer pre-analyze automatically if no .slicer-cache/ directory is found, then builds with the full 3-layer pipeline.

Install

# Stable binary (source slicing + warmup CLI)
cargo install cargo-slicer

# Nightly driver (virtual slicer — the fast path)
cargo +nightly install cargo-slicer \
  --features rustc-driver \
  --bin cargo-slicer-rustc \
  --bin cargo_slicer_dispatch

If you are building from source:

git clone https://github.com/yijunyu/cargo-slicer
cd cargo-slicer

cargo install --path .
cargo +nightly install --path . --profile release-rustc \
  --features rustc-driver \
  --bin cargo-slicer-rustc \
  --bin cargo_slicer_dispatch

WSL on Windows drives (/mnt/c/, /mnt/d/): prefix every cargo install with SCCACHE_IDLE_TIMEOUT=0 to avoid NTFS permission errors. See Troubleshooting.

Quickstart: one command

Run the full pipeline against your project:

cd your-project
cargo-slicer.sh .

This runs four steps automatically:

Warm the registry cache (cargo-warmup init --tier=1)
Pre-analyze the workspace call graph
Plan the critical compilation path
Build with the three-layer RUSTC_WRAPPER chain

On the first run the warmup step takes ~10 minutes (it compiles the top-tier registry crates once). Every subsequent cold build is served from cache.

Manual setup (step by step)

If you prefer to control each step:

# Step 1: warm the registry cache (one-time, ~10 min)
cargo-warmup init --tier=1

# Step 2: pre-analyze the workspace (seconds)
cd your-project
cargo-slicer pre-analyze

# Step 3: build with virtual slicing
cargo clean
CARGO_SLICER_VIRTUAL=1 CARGO_SLICER_CODEGEN_FILTER=1 \
  RUSTC_WRAPPER=$(which cargo_slicer_dispatch) \
  cargo +nightly build --release

Important: never set RUSTC_WRAPPER when building cargo-slicer itself. Unset it before running cargo install --path ..

How It Works

Rust builds are slow for a structural reason: rustc compiles each crate in isolation. When it compiles a library crate it cannot know which of its public functions will be called by downstream code, so it compiles all of them. In a large workspace most of that work is wasted.

cargo-slicer attacks the problem from two angles simultaneously.

The two techniques

Virtual Slicer

The virtual slicer inserts itself as a RUSTC_WRAPPER. Before each crate is compiled it performs a reachability analysis — a BFS starting from the entry point of the binary — and replaces every unreachable function body with an abort() stub. LLVM never sees those functions, so it never optimises, inlines, or emits machine code for them.

For a crate like image, which exposes hundreds of format decoders and pixel converters, a typical application uses one or two formats. The slicer stubs the rest. LLVM's work drops by 97%.

→ Virtual Slicer details

Warm-Cache Daemon

Registry crates (crates.io dependencies) do not change between builds. The warm-cache daemon pre-compiles them once and stores the resulting .rlib and .rmeta artefacts. On every subsequent build, instead of re-running rustc, the RUSTC_WRAPPER copies the cached artefact into target/ in milliseconds.

The cache key is SHA256(crate + version + rustc_version + features + opt_level), so the cached artefact is safe to share across projects and across git branches. A warmed cache built while compiling zed is reused immediately when compiling nushell.

→ Warm-Cache Daemon details

The three-layer pipeline

When both techniques run together, three wrappers are chained:

cargo build --release
    │
    ▼  RUSTC_WRAPPER = cargo_warmup_dispatch
    │  ├─ registry crate? → serve from cache, return immediately
    │  └─ local crate? → pass to next wrapper
    │
    ▼  CARGO_WARMUP_INNER_WRAPPER = cargo_slicer_dispatch
    │  ├─ no unreachable fns (cache hit)? → pass to real rustc directly
    │  └─ has unreachable fns? → pass to driver
    │
    ▼  CARGO_SLICER_DRIVER = cargo-slicer-rustc
       └─ MIR analysis → stub unreachable fns → LLVM codegen on minimum set

The cargo-slicer.sh script sets up this chain automatically.

→ The All-in-One Script

Virtual Slicer

The virtual slicer is a RUSTC_WRAPPER that stubs unreachable functions at the MIR level before LLVM sees them. It does not modify your source files or your Cargo.toml.

What "unreachable" means

Starting from the binary's main function, cargo-slicer traces the call graph across all workspace crates. Any function that cannot be reached from main is replaced with an abort() body. LLVM skips compilation, optimisation, and code emission for those functions entirely.

The analysis is conservative: trait impls, generics, async functions, closures, and any function called through a function pointer are always kept.

How it plugs into cargo

RUSTC_WRAPPER=cargo_slicer_dispatch  ← stable binary, < 1 ms startup
    │
    └─ local workspace crate?
           CARGO_SLICER_DRIVER=cargo-slicer-rustc  ← nightly driver, ~300 ms startup
               └─ BFS reachability analysis
               └─ MIR stub replacement
               └─ CGU filtering (skip codegen units with only stubs)

The dispatch binary keeps the nightly driver out of the fast path. Registry crates (which change rarely and are cached) never pay the 300 ms driver load.

Cross-crate pre-analysis

For accurate reachability across crate boundaries, run pre-analysis before the build:

cargo-slicer pre-analyze

This uses syn-based parsing to build a call graph across all workspace crates in seconds, writing results to .slicer-cache/. The driver reads these files at build time instead of re-analysing from scratch for every crate.

Without pre-analysis the slicer falls back to conservative per-crate analysis, which still works but produces fewer stubs.

Tuning

Environment variable	Effect
`CARGO_SLICER_VIRTUAL=1`	Enable virtual slicing
`CARGO_SLICER_CODEGEN_FILTER=1`	Skip CGUs that contain only stubs
`CARGO_SLICER_DEBUG=1`	Write a debug log to `.cargo-slicer-debug.log`
`CARGO_SLICER_SKIP_THRESHOLD=auto`	Skip driver for crates with no predicted stubs (default)
`CARGO_SLICER_SKIP_THRESHOLD=0`	Always load the driver for every local crate

What cannot be stubbed

The slicer never stubs:

Trait impl associated functions (vtable entries)
Generic functions (monomorphised at the call site)
async fn and closures
unsafe fn (unless CARGO_SLICER_RELAX_UNSAFE=1)
Any function reachable through a function pointer

These constraints are intentional. Stubbing them would either cause linker errors or produce incorrect binaries.

Upstream proposal

The virtual slicing logic has been extracted into a proposed rustc patch behind a -Z dead-fn-elimination flag. If accepted upstream, the install story becomes:

RUSTFLAGS="-Z dead-fn-elimination" cargo +nightly build --release

No extra binary, no nightly ABI compatibility shims.

Auto-delegation in `cargo_slicer_dispatch`

The dispatch wrapper detects whether the real rustc already implements the flag (rustc -Zhelp | grep dead-fn-elimination, cached per process). When present, dispatch transparently appends -Zdead-fn-elimination and execs rustc directly — bypassing the userspace driver, the .slicer-cache, and the per-nightly ABI shim. The userspace path is still used as a fallback when the flag is absent.

# Patched stable: dispatch detects the flag and delegates automatically
CARGO_SLICER_VIRTUAL=1 RUSTC_WRAPPER=$(which cargo_slicer_dispatch) \
  cargo +dfe-stage1 build --release

# Same command on unpatched nightly: dispatch falls back to the userspace driver
CARGO_SLICER_VIRTUAL=1 RUSTC_WRAPPER=$(which cargo_slicer_dispatch) \
  cargo +nightly build --release

Override knobs: CARGO_SLICER_NO_UPSTREAM_FLAG=1 forces the userspace path even when the flag is available; CARGO_SLICER_FORCE_UPSTREAM_FLAG=1 skips the probe and assumes the flag is supported.

Review status (2026-04)

The patch has been through a nine-point review by @petrochenkov (V1–V9 in the response document). After applying the patches:

Scope is now explicit: binary crates only — library crate types early-return (V1).
Seeds are reachable_set ∪ entry_fn ∪ address_taken ∪ vtable_methods rather than a re-implementation of reachable_set (V3, V5a, V5b, V8).
The cross-crate call graph has been removed; cross-crate effects come through reachable_set seeding (V4).
requires_monomorphization replaces blanket generics > 0, so lifetime-only generics are eligible (V6b).
Vestigial #[test]/#[bench], unsafe, and signature-walk checks have been deleted — their safety is now provided structurally by the seed set (V7, V8a, V8b).
Net effect on dead_fn_elim.rs: ~419 → ~340 lines.

Empirical answer to V9 ("will it converge to reachable_set?"): on the ASE 2026 corpus sweep, reported separately by crate kind per V10/V11. Binary subset (n=65): 59/65 built under both legs, zero slicer-only failures, median 1.38× speedup. Library subset (userspace tool only — in-tree flag is a no-op there today): 2,393/2,538 built, zero slicer-only failures, userspace median 1.50×. The patched stage1 oracle on rust-1.90.0 eliminated 904 functions on ripgrep with binary output identical to the baseline.

Full point-by-point response (V1–V11): vadim-response-results.md.

Warm-Cache Daemon

The warm-cache daemon (also called cargo-warmup) pre-compiles registry crates once and serves the cached .rlib / .rmeta artefacts on every subsequent build. It is the Rust equivalent of a precompiled-header daemon for C/C++.

The insight

Registry crates do not change between your builds. syn, serde, tokio, proc-macro2 — these are compiled identically every time you run cargo clean && cargo build. Compilation caches like sccache help on the second build, but every fresh environment (new developer, CI machine, Docker container) pays the full cost again.

The warm-cache daemon shifts that cost to a one-time investment. Pre-warm the top-tier registry crates once (takes ~10 minutes). Every cold build afterwards — in any project that depends on those crates — skips their compilation entirely.

Cache key

The cache key is:

SHA256(crate_name + version + rustc_version + edition + features + opt_level)

-C metadata and -C extra-filename are excluded. These differ per project but do not affect the correctness of the compiled artefact. Excluding them is what enables cross-project sharing: the .rlib compiled while building zed is reused directly when building nushell.

Usage

# One-time warm (adds ~10 min, saves that time on every cold build after)
cargo-warmup init --tier=1

# Check cache status
cargo-warmup status

# Use in builds
RUSTC_WRAPPER=$(which cargo_warmup_dispatch) cargo +nightly build --release

cargo-slicer.sh runs cargo-warmup init --tier=1 automatically on first use.

Tiers

Tier	Crates included	Warm time
`--tier=1`	proc-macro2, quote, syn, serde, tokio, + 5 more core crates	~20 s
`--tier=2`	+ 50 most common transitive deps	~3 min
`--tier=3`	All crates.io top-500	~10 min

Tier 1 gives the best return on investment for most projects. Tier 3 is useful in CI environments where build time is money.

How it plugs into cargo

RUSTC_WRAPPER=cargo_warmup_dispatch
    │
    ├─ cache hit?  → copy .rlib to target/, return in < 1 ms
    └─ cache miss? → invoke real rustc, store result in cache

The dispatch binary adds less than 1 ms per crate invocation on a cache hit.

By default the cache lives in ~/.cargo/warmup-cache/. Any project on the same machine with matching crate versions and rustc toolchain automatically benefits from a warm cache built by any other project.

To inspect what is cached:

cargo-warmup status
# or
sqlite3 ~/.cargo/warmup-cache/index.db \
  'SELECT crate, version, cached_at FROM artefacts ORDER BY cached_at DESC LIMIT 20'

The All-in-One Script

cargo-slicer.sh runs the full four-step pipeline automatically.

cargo-slicer.sh /path/to/your/project
# or, from inside the project:
cargo-slicer.sh .

Pass extra cargo build arguments after the project path:

cargo-slicer.sh . --features my-feature
cargo-slicer.sh . --no-default-features

What it does

Step 0 — Warm the registry cache

cargo-warmup init --tier=1

Skipped if the cache is already warm. On first run this takes ~10–20 seconds for tier-1 (the 10 most common registry crates).

Step 1 — Pre-analyze the workspace call graph

cargo-slicer pre-analyze

Builds a cross-crate call graph using syn-based static analysis. Writes .slicer-cache/*.analysis and .slicer-cache/*.seeds. Takes 0.5 s (ripgrep) to 12 s (zed).

Step 2 — Plan the critical path

cargo-warmup pch-plan

Schedules crate compilation in an order that minimises the critical path, so parallelism is maximised across the three-layer wrapper chain.

Step 3 — Build with the wrapper chain

RUSTC_WRAPPER=cargo_warmup_dispatch \
CARGO_WARMUP_INNER_WRAPPER=cargo_slicer_dispatch \
CARGO_SLICER_VIRTUAL=1 \
CARGO_SLICER_CODEGEN_FILTER=1 \
CARGO_SLICER_DRIVER=$(which cargo-slicer-rustc) \
  cargo +nightly build --release "$@"

The three-layer chain:

cargo_warmup_dispatch — serves registry crates from cache (< 1 ms each)
cargo_slicer_dispatch — routes local crates to the driver or real rustc
cargo-slicer-rustc — stubs unreachable functions, filters CGUs

Installation

cargo-slicer.sh is installed alongside the binary:

cargo install cargo-slicer
which cargo-slicer.sh   # → ~/.cargo/bin/cargo-slicer.sh

Or, from a source checkout:

./cargo-slicer.sh .     # runs directly from the repo

Usage Reference

Virtual Slicing (recommended)

Linux / macOS

# Install nightly driver (one-time)
cargo +nightly install cargo-slicer \
  --features rustc-driver \
  --bin cargo-slicer-rustc \
  --bin cargo_slicer_dispatch

# Pre-analyze workspace call graph (seconds)
cargo-slicer pre-analyze

# Build with virtual slicing
cargo clean
CARGO_SLICER_VIRTUAL=1 CARGO_SLICER_CODEGEN_FILTER=1 \
  RUSTC_WRAPPER=$(which cargo_slicer_dispatch) \
  cargo +nightly build --release

WSL on Windows drives (`/mnt/c/`, `/mnt/d/`, …)

Same as above but disable sccache to avoid NTFS permission errors:

SCCACHE_IDLE_TIMEOUT=0 cargo +nightly install cargo-slicer \
  --features rustc-driver \
  --bin cargo-slicer-rustc \
  --bin cargo_slicer_dispatch

cargo-slicer pre-analyze

cargo clean
CARGO_SLICER_VIRTUAL=1 CARGO_SLICER_CODEGEN_FILTER=1 \
  CARGO_SLICER_SCCACHE=/nonexistent \
  RUSTC_WRAPPER=$(which cargo_slicer_dispatch) \
  cargo +nightly build --release

Subcommands

Subcommand	Description
(default)	Source slicing: copy deps, delete unused items
`build [ARGS]`	Slice deps then build with sliced crates
`pre-analyze [--parser BACKEND]`	Cross-crate static analysis for virtual slicing
`generate [-o DIR] [--delete]`	Write a sliced source copy without modifying the original
`script <file.rs> [args...]`	Run a nightly `-Zscript` single-file Rust script with slicing enabled
`rl-bench [OPTIONS]`	Measure compile speedup as RL training KPIs

Pre-analysis parser backends

cargo-slicer pre-analyze                # syn (default, most accurate)
cargo-slicer pre-analyze --parser fast  # fast tokenizer
cargo-slicer pre-analyze --parser ctags # items only, no call edges

Backend	Speed	Call edges	Use when
`syn`	0.5–12 s	Yes, accurate	Default — best stubs
`fast`	< 1 s	Yes, approximate	Large workspaces, time-sensitive
`ctags`	Fastest	None	Items-only analysis

cargo-script (nightly `-Zscript`)

Run a single-file Rust script (cargo's nightly -Zscript feature) with cargo-slicer enabled.

Quick install (script subcommand only)

curl -fsSL https://raw.githubusercontent.com/yijunyu/cargo-slicer/main/install-script.sh | bash

This installs only the two binaries the script subcommand needs (cargo-slicer + cargo_slicer_dispatch, ~5 MB total). It skips the warmup cache, clang-daemon, and the nightly rustc-driver build that the full install.sh would set up.

After install, run cargo-slicer script --check to verify the setup.

Usage

The shebang line is the only thing that distinguishes a cargo-slicer script from a plain cargo-script:

#!/usr/bin/env cargo-slicer
---cargo
[dependencies]
regex = "1"
---

fn main() {
    let re = regex::Regex::new(r"\w+").unwrap();
    println!("{:?}", re.find("hello world"));
}

Make it executable and run it directly:

chmod +x hello.rs
./hello.rs

Or invoke explicitly:

cargo-slicer hello.rs [args...]
# or, equivalent, with the explicit subcommand form:
cargo-slicer script hello.rs [args...]

When cargo-slicer's first argument is an existing .rs file, it routes to the script subcommand automatically — so the shebang stays as short as a plain cargo-script shebang.

What it does

Sets CARGO_SLICER_VIRTUAL=1, CARGO_SLICER_CODEGEN_FILTER=1, and CARGO_SLICER_CACHE_DIR=$TMPDIR/cargo-slicer-script-<hash> (keyed by the absolute script path, so caches don't litter the user's cwd).
If the script's ---cargo frontmatter omits edition = "...", writes a patched copy into the cache dir with edition = "2024" injected so cargo-script doesn't emit the "no edition" warning.
Detects whether the active nightly already has -Z dead-fn-elimination (the in-tree patch). If so, it skips RUSTC_WRAPPER and passes the flag via CARGO_ENCODED_RUSTFLAGS (Fast Path 3). Otherwise it sets RUSTC_WRAPPER=cargo_slicer_dispatch and uses the userspace driver.
execs cargo +nightly -Zscript <file> [args...].

Caveats

-Zscript is unstable and only available on nightly.
For tiny single-file scripts the slicer's contribution is mostly eliminating dead code in registry dependencies (the userspace driver path skips the one-crate script body by the auto skip-threshold heuristic). The -Z dead-fn-elimination fast path applies to both.
No pre-analysis step runs — there's no workspace Cargo.toml to walk. The in-tree flag does its own BFS from the script's main(), so this is fine for the upstream-flag path.

Source slicing (stable, no nightly)

cargo-slicer                # slice all deps
cargo-slicer regex          # slice one crate
cargo-slicer --clean        # clean and re-slice
cargo-slicer -O             # fast production mode (skip verification)
cargo-slicer build --release  # slice + build

Optimization levels

Level	Description
`-O0`	No deletion — safe baseline
`-O1`	Delete private functions (with verification)
`-O2`	Delete all private items + trial deletion
`-O3`	Graph-guided deletion (default)
`-O`	Fast production — skip verification

Environment Variables

Core

Variable	Default	Description
`CARGO_SLICER_VIRTUAL`	unset	Set to `1` to enable virtual slicing
`CARGO_SLICER_CODEGEN_FILTER`	unset	Set to `1` to skip CGUs containing only stubs
`RUSTC_WRAPPER`	unset	Set to path of `cargo_slicer_dispatch`
`CARGO_SLICER_DRIVER`	unset	Set to path of `cargo-slicer-rustc`

Cross-crate analysis

Variable	Default	Description
`CARGO_SLICER_CROSS_CRATE`	unset	Set to `1` to enable cross-crate analysis
`CARGO_SLICER_PARSER`	`syn`	Pre-analysis backend: `syn`, `fast`, or `ctags`

MIR-precise analysis

Variable	Default	Description
`CARGO_SLICER_MIR_PRECISE`	unset	Set to `1` for MIR-level whole-program analysis
`CARGO_SLICER_WORKSPACE_CRATES`	unset	Comma-separated list of workspace crates to harvest

Performance tuning

Variable	Default	Description
`CARGO_SLICER_SKIP_THRESHOLD`	`auto`	Skip driver when predicted stubs < threshold. `auto` = skip 0-stub crates; `0`/`never` = never skip
`CARGO_SLICER_DAEMON`	unset	Set to `1` to enable fork-server (amortises 300 ms driver load)
`CARGO_SLICER_SCCACHE`	auto	Path to sccache, or `/nonexistent` to disable
`CARGO_SLICER_RELAX_UNSAFE`	unset	Set to `1` to allow stubbing `unsafe fn`

Upstream `-Z dead-fn-elimination` delegation

When the real rustc supports the in-tree MCP flag, dispatch delegates automatically (see Virtual Slicer → Auto-delegation).

Variable	Default	Description
`CARGO_SLICER_NO_UPSTREAM_FLAG`	unset	Set to `1` to force the userspace driver even when `-Z dead-fn-elimination` is available
`CARGO_SLICER_FORCE_UPSTREAM_FLAG`	unset	Set to `1` to skip the `rustc -Zhelp` probe and assume the flag is present

Caching

Variable	Default	Description
`CARGO_SLICER_CACHE_DIR`	`.slicer-cache`	Directory for incremental cache files
`CARGO_SLICER_NO_CACHE`	unset	Set to `1` to disable caching entirely

Debugging

Variable	Default	Description
`CARGO_SLICER_DEBUG`	unset	Set to `1` to enable debug logging
`CARGO_SLICER_DEBUG_LOG`	`.cargo-slicer-debug.log`	Custom path for debug log
`CARGO_SLICER_MARKED_OUT`	unset	Write marked items to a file for inspection

Troubleshooting

`RUSTC_WRAPPER` breaks building cargo-slicer itself

Symptom: cargo install --path . fails with mysterious compilation errors.

Cause: RUSTC_WRAPPER=cargo_slicer_dispatch is set in your environment from a previous virtual-slicing session. It intercepts compilation of cargo-slicer's own dependencies.

Fix: Unset it before building cargo-slicer:

unset RUSTC_WRAPPER CARGO_SLICER_VIRTUAL CARGO_SLICER_CODEGEN_FILTER
cargo install --path .

Only set RUSTC_WRAPPER when building your target project.

sccache permission errors on WSL `/mnt/` drives

Symptom: failed to set permissions errors during cargo install on /mnt/c/ or /mnt/d/.

Cause: NTFS does not support Unix file permissions. sccache creates files with Unix permissions that NTFS cannot store.

Fix: Build on the native Linux filesystem:

cd ~
git clone https://github.com/yijunyu/cargo-slicer
cd cargo-slicer
SCCACHE_IDLE_TIMEOUT=0 cargo install --path .

Or, if you must stay on the Windows drive, disable sccache entirely:

SCCACHE_IDLE_TIMEOUT=0 cargo install --path .
# and when building your project:
CARGO_SLICER_SCCACHE=/nonexistent RUSTC_WRAPPER=... cargo +nightly build --release

Stale `.slicer-cache/` after updating the driver

Symptom: unexpected stub failures or missed stubs after upgrading cargo-slicer.

Fix: Delete the cache:

rm -rf .slicer-cache/

Nightly toolchain mismatch

Symptom: cargo-slicer-rustc crashes at startup with a rustc_private ABI error.

Cause: The driver binary was compiled against a different nightly than the one currently active.

Fix: Rebuild the driver against the active nightly:

rustup update nightly
cargo +nightly install --path . --profile release-rustc \
  --features rustc-driver \
  --bin cargo-slicer-rustc \
  --bin cargo_slicer_dispatch

Build succeeds but no speedup

Likely causes:

Check .cargo-slicer-debug.log for skip-driver markers — all crates skipped means the threshold is too aggressive. Fix: CARGO_SLICER_SKIP_THRESHOLD=0.
The project is a library crate with no binary entry point. The slicer is most effective on binary crates with deep dependency trees.
Pre-analysis was not run. Run cargo-slicer pre-analyze first.

Benchmarks

All numbers are cold builds (after cargo clean) on a 48-core Linux server with nightly Rust.

Virtual slicer — rust-perf standard suite (not yet re-verified)

These single-crate numbers were measured without -Z threads=8 or the wild linker. They have not been re-verified with the current fair-RUSTFLAGS protocol and may overstate speedups (same apples-to-oranges issue as the retracted workspace numbers above).

Project	Baseline	cargo-slicer	Speedup
image 0.25.6 (lib)	40,742 ms	1,461 ms	27.9×
ripgrep 14.1.1 (bin)	24,094 ms	5,891 ms	4.09×
cargo 0.87.1 (workspace)	133,797 ms	61,922 ms	2.16×
diesel 2.2.10 (lib)	25,854 ms	14,339 ms	1.80×
syn 2.0.101 (lib)	6,711 ms	4,157 ms	1.61×
serde 1.0.219 (lib)	3,951 ms	3,966 ms	1.00×

serde is already minimal — almost all of its code is reachable via derive macros. The slicer correctly identifies this.

Virtual slicer — real binary projects

All measurements use identical RUSTFLAGS for both baseline and vslice-cc (-Z threads=8 -C linker=clang -C link-arg=--ld-path=wild). 48-core machine, Apr 2026, 2–3 runs per mode.

Project	Baseline	vslice-cc	Speedup	Notes
helix (16 local crates)	68 s	44 s	1.55×
ripgrep (50K LOC)	10.5 s	7 s	1.50×
zed (209 local crates)	1098 s	767 s	1.43×	76 driver, 131 skip
zeroclaw (4 local crates)	686 s	522 s	1.31×	3,786 stubs / ~241k mono items (1.6% overall, 4.4% bin)
nushell (41 local crates)	103 s	82 s	1.26×

Retracted claims: nushell was reported at 5.1× — apples-to-oranges RUSTFLAGS mismatch; re-measured speedup is 1.26×. cargo-slicer (self) was claimed at 1.74× but re-verified at 1.00× (only 1 driver crate, 0 stubs).

Docker benchmarks (`docker run cargo-slicer bench`)

Fair comparison inside Docker: same nightly toolchain, cargo fetch before timing (excludes download time), cargo clean between baseline and slicer. Slicer timing includes cargo-slicer pre-analyze overhead.

Project	Baseline	Slicer	Speedup
zed (209 crates)	1149 s	545 s	2.11×
helix (16 crates)	95 s	59 s	1.61×
zeroclaw (4 crates)	842 s	542 s	1.55×
ripgrep (17 crates)	15 s	12 s	1.31×
nushell (41 crates)	118 s	94 s	1.25×

Docker speedups are higher than bare-metal for large projects (zed 2.11× vs 1.43×) because fewer cores amplify the benefit of eliminating codegen work — less parallelism means each eliminated function saves more wall time.

# Run the benchmark yourself
docker build -t cargo-slicer .
docker run --rm -v /path/to/project:/workspace/project cargo-slicer bench

Warm-cache daemon — verified (Apr 2026)

Both baseline and warmed use nightly + -Z threads=8. Interleaved rounds, dispatch pre-warmed, rm -rf target/ before each run.

Crate	Baseline	Warmed	Speedup
image 0.25	4.9 s	2.1 s	2.3×
syn 2.0	1.0 s	0.66 s	1.5×

An earlier version of this table claimed 8.5× for image (40.7 s → 4.8 s) and 1.7× for syn (6.7 s → 4.0 s). Those baselines were measured without -Z threads=8 and the wild linker, while the warmed runs had them — the same apples-to-oranges error as the nushell 5.1×. cargo 0.87.1 (claimed 2.3×) is a regression with fair RUSTFLAGS: baseline 15 s vs warmed 64 s — dispatch overhead serializes what -Z threads=8 parallelizes across 48 cores.

A warm cache populated by one project is reused across all projects on the same machine.

Upstream `-Z dead-fn-elimination` patch

These numbers come from the in-tree rustc patch (src/upstream_patch/), which implements the same algorithm natively in the compiler.

Project	Baseline	-Z dead-fn-elimination	Reduction
zed	1,790 s	1,238 s	−31%, 9.2 min saved
rustc workspace (67 crates) ¹	336 s	176 s	−48%, 2.7 min saved
ripgrep	13 s	13 s	break-even (all fns reachable)

Per @petrochenkov's V2 review feedback, the "rustc" row reflects x.py build compiler/rustc --stage 1 — the 67 workspace crates that make up librustc_driver.so, not the ~70-line rustc binary crate. The original "rustc" label was misleading.

Patched stage1 oracle (rust-1.90.0 stable, 2026-04-26)

The in-tree patch was rebuilt against rust-1.90.0 (commit 1159e78c) with [rust] debug-assertions = true, overflow-checks = true to runtime-check the V9 invariant reachable_set ⊆ post-BFS-set.

Run	Wall time	Fns eliminated	Output check
stage1 baseline (ripgrep)	62.1 s	0	runs
stage1 + `-Z dead-fn-elim`	59.9 s	904	identical to baseline

debug_assert holds — no ICE, binary correct under the seed-set invariant.

ASE 2026 corpus sweep — top 2,669 crates by downloads

Correctness validation on a representative slice of the ecosystem. Run via scripts/bench_ase_corpus.sh; library crates gated on build success, binary crates additionally smoke-tested with --version / --help.

V10/V11 reframing (2026-04-29): numbers split by crate kind. The in-tree -Z dead-fn-elimination flag is a no-op on libraries today (V1 early-return); the userspace cargo-slicer tool's RUSTC_WRAPPER pipeline does run on libraries. They are reported separately so the in-tree claim applies only to the binaries it actually runs on.

Binary subset (n=65) — relevant to in-tree -Z dead-fn-elimination:

Metric	Value
Binary crates attempted	65
Both legs built	59
Slicer-only failures	0
Median build speedup	1.38×
Mean build speedup	2.45×
% speedup ≥ 1.0×	69.5%
% speedup ≥ 1.5×	45.8%
% speedup ≥ 2.0×	27.1%

Library subset (n=2,538) — userspace cargo-slicer only, NOT the -Z flag: 2,393 of 2,538 libraries built under both legs with zero slicer-only failures; userspace median 1.50×. This number measures cross-crate orchestration in the userspace tool, not the single-crate in-tree flag.

Full corpus catalog (all 2,669 crates with rank, version, downloads, build times, and slicer status): ASE 2026 Corpus · CSV.

Full point-by-point response to the @petrochenkov V1–V11 review and reproduction instructions live in vadim-response-results.md on the cargo-slicer repository.

C/C++ projects — clang-daemon PCH acceleration

build-accelerate.sh (included in the image) auto-detects C/C++ projects and injects a precompiled header via clang-daemon. The technique eliminates repeated header parsing across parallel compilation units.

Already benchmarked (48-core server, Clang 21, -j48):

Project	Stars	Files	Baseline	Accelerated	Speedup	Notes
Linux kernel 6.14	227k	26,339	~890 s	~730 s	1.22×	GCC fallback for asm-heavy files
LLVM 20	—	~2,873	measured	measured	1.22×	Clang 21 compiling Clang 20
LLVM 21	—	~2,873	measured	measured	1.24×	Self-hosted build
vim	—	~300	baseline	accelerated	1.3×	Small project, overhead minimal
sqlite3	—	1 (amalgam)	20 s	20.2 s	1.01×	Single-file; PCH gives nothing

Predicted speedup for top starred projects (based on file count × header density model):

Rank	Project	Stars	Lang	Files	LOC	Build	Predicted	Reason
1	Linux	227k	C	26,339	~20M	Make	1.2× ✅ benchmarked
2	TensorFlow	195k	C++	~650	~2.5M	Bazel/CMake	1.15–1.25×	Heavy STL + proto headers
3	Godot	109k	C++	~3,500	~8.6M	SCons	1.2–1.3×	Large header graph
4	Electron	121k	C++	(Chromium)	~25M	ninja	1.2×	Chromium-scale header reuse
5	OpenCV	87k	C++	~1,000	~600K	CMake	1.15–1.2×	Dense OpenCV headers
6	FFmpeg	58k	C	~500	~1M	autotools	1.1–1.2×	libav* headers per file
7	Bitcoin	89k	C++	~500	~750K	CMake	1.1–1.2×	Boost + secp256k1 headers
8	Netdata	78k	C	~700	~700K	CMake	1.1–1.15×	Moderate header depth
9	Redis	74k	C	~250	~330K	Make	1.05–1.1×	Shallow headers, small codebase
10	Git	60k	C	~400	~140K	Make	1.05–1.1×	Minimal headers
—	llama.cpp	102k	C++	~150	~250K	CMake	1.05×	Small; GGML headers not dense
—	sqlite3	—	C	1	~255K	Make	≈1×	Amalgamation; no parallelism

Key insight: speedup scales with (files × header parse fraction). Projects with thousands of files each including the same heavyweight headers (Linux, Godot, TensorFlow, Chromium) get the most benefit. Single-file amalgamations (sqlite3) and projects with shallow headers (Redis, Git) get little to none.

To run against any of these projects:

# Clone and accelerate (auto-detects C/C++ via compile_commands.json or Makefile)
git clone https://github.com/torvalds/linux
build-accelerate.sh ./linux

# Or via Docker (mounts your checkout)
docker run --rm --cpus=48 \
  -v $(pwd)/linux:/workspace/project \
  ghcr.io/yijunyu/cargo-slicer:latest

For projects using SCons (Godot) or Bazel (TensorFlow), generate compile_commands.json first:
# Godot
scons compiledb
# TensorFlow (CMake path)
cmake -DCMAKE_EXPORT_COMPILE_COMMANDS=ON -B build && cp build/compile_commands.json .

Running benchmarks yourself

# Multi-crate CI benchmark (7 projects, baseline vs vslice-cc, 3 runs each)
./scripts/ci_bench_multicrate.sh

# Individual project
./scripts/bench_fresh_build.sh nushell baseline 3
./scripts/bench_fresh_build.sh nushell vslice-cc 3

# RL training KPI report
cargo-slicer rl-bench --project /tmp/your-project --runs 2

Results are stored in bench-results.db (SQLite).

ASE 2026 Corpus

The ASE 2026 corpus is the empirical evaluation set for cargo-slicer's correctness and speedup claims. It is the top 2,669 crates by all-time downloads on crates.io, fetched 2026-04-26.

This page is the canonical reference for the corpus. Other documents and README sections that mention "the ASE 2026 corpus sweep" link here.

Headline numbers

Note (2026-04-29 reframing). Per follow-up review V10/V11 from @petrochenkov, speedup numbers are split by crate kind. The in-tree -Z dead-fn-elimination flag is a no-op on libraries today (early-return when entry_fn().is_none()), so library numbers cannot be folded into a single headline alongside binary numbers. The userspace cargo-slicer tool's RUSTC_WRAPPER pipeline does run on libraries, and its numbers are reported separately below. See docs/vadim-response-results.md for the full V10/V11 discussion.

Corpus shape (single sweep, both legs)

Metric	Value
Crates fetched	2,669
Tarball / extract errors	10
Crates that ran	2,603
Library crates	2,538
Binary crates	65
Both legs built (clean compare)	2,452
Baseline-only failures	151
Slicer-only correctness regressions	0

Across the full corpus the slicer leg never failed when the baseline succeeded — this correctness statement holds for both kinds.

Binary subset (n=65) — relevant to in-tree `-Z dead-fn-elimination`

Metric	Value
Binary crates attempted	65
Both legs built	59
Baseline-only failures	6
Slicer-only failures	0
Median build speedup	1.38×
Mean build speedup	2.45×
% speedup ≥ 1.0×	69.5%
% speedup ≥ 1.5×	45.8%
% speedup ≥ 2.0×	27.1%
10th percentile	0.67×
90th percentile	3.58×

Library subset (n=2,538) — userspace cargo-slicer only, NOT the `-Z` flag

Metric	Value
Library crates attempted	2,538
Both legs built	2,393
Baseline-only failures	145
Slicer-only failures	0
Median build speedup (userspace tool)	1.50×
Mean build speedup (userspace tool)	3.99×
10th percentile	0.65×
90th percentile	7.42×

The library median is not a claim about -Z dead-fn-elimination. It is a measurement of cross-crate orchestration in the userspace tool, which is where (per V11) the algorithm actually earns its keep — single-crate elimination overlaps heavily with -Wunused + monomorphization + LLVM DCE.

Speedup distribution

Bucket	Crates	% of compared
< 0.5× (regression)	128	5.2%
0.5 – 0.8×	281	11.5%
0.8 – 1.0×	251	10.2%
1.0 – 1.5×	571	23.3%
1.5 – 2.0×	341	13.9%
2.0 – 5.0×	519	21.2%
5.0 – 20.0×	272	11.1%
≥ 20.0×	89	3.6%

The wall-time regression tail (speedup < 1.0×, 26.9% of crates) is concentrated on tiny crates where the slicer's per-invocation overhead dominates a sub-1-second baseline build. None of these 660 crates are correctness regressions — every one of them produced a correct binary; they just took longer to build than the baseline. For full reproducibility, all of them are kept in the corpus and in the published CSV.

Methodology

# 1. Fetch top 2,669 crates by downloads
ase2026d/crates-corpus/fetch-crates.sh

# 2. Run baseline + slicer back-to-back on each tarball, 8-way parallel
for f in fetch/*.crate; do
    while [ "$(jobs -r -p | wc -l)" -ge 8 ]; do wait -n; done
    NAME=...; VERSION=...
    ./scripts/bench_ase_corpus.sh "$NAME" "$VERSION" &
done
wait

# 3. Aggregate
python3 scripts/aggregate_ase_results.py > results/aggregate_full.json

For each crate the harness:

Extracts the tarball.
Runs cargo +nightly build --release --offline (baseline), retrying online once on failure to fetch transitive deps.
Runs cargo +nightly build --release again under RUSTC_WRAPPER=cargo_slicer_dispatch with CARGO_SLICER_VIRTUAL=1 CARGO_SLICER_CODEGEN_FILTER=1.
For binary crates: smoke-tests the produced binary with --version then --help (both legs must succeed for "correctness_ok").
For library crates: build success is the correctness signal — cargo test is intentionally skipped on the slicer leg because the userspace slicer over-stubs #[test] functions (V7 issue, orthogonal to dead-fn-elimination's binary-output property).
Records (baseline_secs, slicer_secs, speedup, correctness_ok) to results/<name>-<version>.bench.json.

Top 25 by speedup

These are the crates with the largest wall-time speedup. The pattern: tiny library crates whose baseline is dominated by codegen of a few large unreachable functions that the slicer eliminates entirely.

Rank	Crate	Version	Kind	Baseline	Slicer	Speedup
2170	aws-sig-auth	0.60.3	lib	39.98 s	0.21 s	190.38×
1193	retain_mut	0.1.9	lib	37.12 s	0.25 s	148.48×
2630	mutate_once	0.1.2	lib	28.07 s	0.20 s	140.35×
866	raw-window-handle	0.6.2	lib	49.13 s	0.42 s	116.98×
958	utf8-width	0.1.8	lib	14.49 s	0.14 s	103.50×
2470	line-wrap	0.2.0	lib	17.40 s	0.19 s	91.58×
2277	sptr	0.3.2	lib	16.77 s	0.21 s	79.86×
799	deadpool-runtime	0.3.1	lib	18.78 s	0.24 s	78.25×
598	md5	0.8.0	lib	19.76 s	0.26 s	76.00×
1811	htmlescape	0.3.1	lib	26.21 s	0.36 s	72.81×
695	indenter	0.3.4	lib	14.26 s	0.20 s	71.30×
1382	cached_proc_macro_types	0.1.1	lib	18.57 s	0.28 s	66.32×
884	endian-type	0.2.0	lib	14.51 s	0.23 s	63.09×
1941	renderdoc-sys	1.1.0	lib	14.92 s	0.24 s	62.17×
2290	subtle-ng	2.5.0	lib	16.14 s	0.26 s	62.08×
925	local-waker	0.1.4	lib	10.29 s	0.17 s	60.53×
1073	safemem	0.3.3	lib	9.54 s	0.16 s	59.62×
1996	replace_with	0.1.8	lib	11.53 s	0.20 s	57.65×
1069	deunicode	1.6.2	lib	21.24 s	0.39 s	54.46×
2195	aws-endpoint	0.60.3	lib	11.85 s	0.22 s	53.86×
1694	khronos_api	3.1.0	lib	46.14 s	0.88 s	52.43×
984	nodrop	0.1.14	lib	9.35 s	0.18 s	51.94×
787	tagptr	0.2.0	lib	12.76 s	0.26 s	49.08×
2023	unscanny	0.1.0	lib	21.43 s	0.45 s	47.62×
625	precomputed-hash	0.1.1	lib	8.92 s	0.19 s	46.95×

Top 25 by downloads

These are the most-depended-upon crates in the corpus. Speedups are more modest because their baselines are already small and their workspaces already have most code reachable through derive macros and re-exports.

Rank	Crate	Version	Downloads	Kind	Baseline	Slicer	Speedup
1	syn	2.0.117	1,595,761,057	lib	5.42 s	4.33 s	1.25×
2	hashbrown	0.17.0	1,469,613,376	lib	1.92 s	1.49 s	1.29×
3	bitflags	2.11.1	1,226,802,506	lib	0.17 s	0.47 s	0.36×
4	getrandom	0.4.2	1,183,144,905	lib	2.44 s	1.81 s	1.35×
5	rand_core	0.10.1	1,106,779,248	lib	0.29 s	0.21 s	1.38×
6	proc-macro2	1.0.106	1,102,185,702	lib	1.87 s	1.12 s	1.67×
7	libc	0.2.186	1,097,915,001	lib	1.70 s	1.53 s	1.11×
8	base64	0.22.1	1,091,953,499	lib	0.28 s	0.58 s	0.48×
9	quote	1.0.45	1,087,228,740	lib	1.39 s	2.37 s	0.59×
10	rand	0.8.6	1,080,778,604	lib	8.48 s	7.97 s	1.06×
11	regex-syntax	0.8.10	1,019,226,974	lib	3.80 s	2.97 s	1.28×
12	indexmap	2.14.0	1,013,962,675	lib	2.24 s	2.06 s	1.09×
13	itertools	0.14.0	1,013,787,654	lib	0.58 s	2.21 s	0.26×
14	cfg-if	1.0.4	975,444,268	lib	0.15 s	0.13 s	1.15×
15	serde	1.0.228	952,229,993	lib	4.33 s	4.13 s	1.05×
16	thiserror-impl	2.0.18	929,278,419	lib	3.43 s	2.07 s	1.66×
17	thiserror	2.0.18	929,103,435	lib	4.06 s	4.37 s	0.93×
18	rand_chacha	0.10.0	927,875,277	lib	4.14 s	4.15 s	1.00×
19	windows-sys	0.61.2	920,841,453	lib	0.60 s	0.35 s	1.71×
20	memchr	2.8.0	907,112,303	lib	1.61 s	1.19 s	1.35×
21	unicode-ident	1.0.24	892,277,911	lib	0.29 s	0.19 s	1.53×
22	serde_derive	1.0.228	889,654,647	lib	6.92 s	6.18 s	1.12×
23	itoa	1.0.18	882,296,719	lib	0.41 s	0.24 s	1.71×
24	autocfg	1.5.0	880,421,082	lib	0.76 s	0.45 s	1.69×
25	heck	0.5.0	865,719,042	lib	0.75 s	0.70 s	1.07×

bitflags, base64, quote, itertools, thiserror: ~1.0× or worse — these crates are tiny (sub-second baseline) and the slicer's per-invocation overhead dominates. They are not correctness regressions.

Full per-crate data

The complete table — all 2,669 crates with rank, version, downloads, build times, and slicer status — lives in a single CSV in the source tree:

docs/ase2026-corpus.csv

Schema:

rank,name,version,downloads,kind,baseline_build_secs,slicer_build_secs,speedup,status
1,syn,2.0.117,1595761057,lib,5.42,4.33,1.252,both_built
2,hashbrown,0.17.0,1469613376,lib,1.92,1.49,1.295,both_built
...

status values:

both_built — baseline + slicer both succeeded; speedup is meaningful.
baseline_failed — the baseline cargo build itself failed (missing feature combos, target dependencies, etc.); the slicer leg is not attempted.
slicer_regression — baseline succeeded, slicer failed. The corpus contains zero entries with this status — that is the headline correctness claim.
error_tarball_missing — fetch failure during fetch-crates.sh.
not_run — corpus entry without a matching result file (typically semver-with-build-metadata that the harness split mis-parsed).

Where this corpus is referenced

Root README
Benchmarks → ASE 2026 corpus sweep
Virtual Slicer → Review status
docs/upstream-rfc.md — MCP for the in-tree -Z dead-fn-elimination flag
docs/vadim-response-results.md — point-by-point response to the @petrochenkov V1–V9 review (V9 empirical answer)

RL Training KPIs

Code RL systems use compilation success as the reward signal. For Rust projects, compile time is 70–90% of the rollout phase — the main bottleneck of the training loop. cargo-slicer rl-bench translates compile speedup into the KPI language used by MLOps teams.

Usage

# Measure current project (2 cold builds per mode)
cargo-slicer rl-bench

# Custom options
cargo-slicer rl-bench --runs 3 --rollout-fraction 0.85 \
  --gpus 16 --project /tmp/your-project

# Persist to bench-results.db
cargo-slicer rl-bench --db bench-results.db

KPIs reported

KPI 1 — Cold-build throughput (samples/hour)

samples/hour = 3600 / compile_time_seconds

KPI 2 — Incremental feedback latency

Time from a one-line edit to the first cargo check result.

KPI 3 — Compute cost per valid sample

cost = compile_time / pass_rate

KPI 4 — Cluster-hour equivalent

How many RL samples fit in one GPU-cluster-hour at a given rollout fraction.

Example output (nushell, 1.26× speedup)

Numbers below are nushell — verified Apr 2026 with identical RUSTFLAGS for both modes (-Z threads=8, wild linker). An earlier version of this example claimed 5.1× for nushell; that was an apples-to-oranges comparison where the baseline lacked the parallel frontend and fast linker. The honest speedup is 1.26× (103 s → 82 s).

  KPI 1 — Cold-Build Throughput (samples/hour)
    Baseline   :   103.0s  →      34 samples/hr
    cargo-slicer:   82.0s  →      43 samples/hr  (1.26× faster)

  KPI 2 — Incremental Feedback Latency (cargo check)
    Baseline   :    12.4s  →     290 feedback-loops/hr
    cargo-slicer:    4.1s  →     878 feedback-loops/hr  (3.0× faster)

  Cluster-Hour Equivalent (8 GPUs, 80% rollout fraction)
    Baseline   :     272 samples / cluster-hour
    cargo-slicer:    344 samples / cluster-hour  (1.26× more data)

Persisting results

Results are written to the rl_kpi table in bench-results.db:

SELECT project, baseline_cold_secs, slicer_cold_secs, speedup,
       slicer_throughput_per_hr, ts
FROM rl_kpi ORDER BY ts DESC LIMIT 10;

Blog

A three-part series on why Rust builds are slow and how cargo-slicer closes the gap.

Part I: The Waiting Game — why the usual tricks don't work
Part II: The Gap — measuring exactly how much work is wasted
Part III: Closing the Gap — how the virtual slicer works

Speeding Up Rust Builds: Part I — The Waiting Game

Part I of III. Part II: The Gap | Part III: Closing the Gap

17 Minutes of Your Life, Gone

Let's talk about Zed.

Zed is a gorgeous code editor written in Rust. Fast. Sleek. Modern. The kind of project that makes you proud to be a Rust developer.

Now try building it from source:

$ time cargo build --release
...
Finished `release` profile in 16m 52s

Seventeen minutes.

You start the build. You check your email. You make coffee. You drink the coffee. You check Reddit. You wonder if you chose the wrong career. The build finishes. You realize you had a typo. You start again.

This isn't a Zed problem. This is a Rust problem. Or rather, a big Rust project problem. Zed has over 500,000 lines of code across 198 workspace crates. That's a lot of Rust for the compiler to chew through.

But surely we can do better, right? The Rust community has been optimizing the compiler for years. Let's try everything.

Attempt 1: Parallel Frontend

Rust nightly has a parallel frontend. More threads, more speed. Simple.

RUSTFLAGS="-Z threads=8" cargo +nightly build --release

Result: the build gets maybe 5-10% faster. Nice, but we're still waiting 15 minutes. The parallel frontend helps with parsing and type checking, but the real time sink is LLVM codegen — and that's already parallelized per codegen unit.

Attempt 2: Faster Linker

Linking takes time. Let's use wild, a fast linker written in Rust:

RUSTFLAGS="-C link-arg=-fuse-ld=wild" cargo +nightly build --release

Result: linking goes from ~8 seconds to ~3 seconds. Great for linking. But linking is less than 1% of the total build time. We've saved 5 seconds out of 1,012. The bottleneck isn't linking.

Attempt 3: Compilation Caching

sccache caches compiled crates, so rebuilds are faster:

RUSTC_WRAPPER=sccache cargo build --release

Result: the second build is blazingly fast. But the first build — a clean, fresh build — is exactly the same. And in CI, every build is a fresh build. Your new developer's first git clone && cargo build? Fresh build. Switching branches with incompatible deps? Fresh build.

Caching doesn't reduce the work. It just remembers it for next time.

Attempt 4: Cranelift Backend

What if we skip LLVM entirely? The Cranelift backend compiles much faster:

RUSTFLAGS="-Z codegen-backend=cranelift" cargo +nightly build

Result: significantly faster compilation. But the output isn't optimized. Cranelift is great for development builds, but for release builds — the ones your users run, the ones CI produces — you want LLVM's optimizations. We need --release to be fast.

Attempt 5: Profile-Guided Optimization of rustc

The Rust project already ships a PGO-optimized compiler. Years of work have gone into making rustc itself faster. The nightly you're using right now benefits from all of that.

And yet, here we are. Seventeen minutes.

The Honest Question

So let me ask you something uncomfortable.

We've tried the parallel frontend. We've tried faster linkers. We've tried caching. We've tried alternative backends. We've tried optimizing the compiler itself.

What if the compiler is already doing its job well? What if the problem isn't how the compiler compiles, but what we're asking it to compile?

Think about it. When you cargo build --release on Zed, the compiler dutifully compiles every public function in every library crate. The regex crate exposes dozens of functions — your project calls maybe three. The serde crate has hundreds of methods — you use a fraction. The compiler doesn't know this. It can't. It's compiling each crate in isolation, and any public function might be called from downstream.

What if a significant chunk of the compiler's work is simply... unnecessary?

What if we could tell the compiler, before it even starts, "hey, you don't need to bother with these 9,000 functions"?

That would be interesting.

To be continued in Part II: The Gap...

This is Part I of a three-part series on cargo-slicer, a tool for speeding up Rust release builds. Part II introduces the "separate compilation gap" and measures just how much work is wasted. Part III shows how to close the gap.

Speeding Up Rust Builds: Part II — The Gap

Part II of III. Part I: The Waiting Game | Part III: Closing the Gap

Previously

In Part I, we tried every known trick to speed up building Zed — parallel frontend, fast linker, caching, alternative backends. Nothing made a dent on that 17-minute clean release build. We ended with a question: what if the problem isn't how the compiler works, but what we're asking it to compile?

Let's find out.

A Library's Dilemma

Consider a library crate — say, serde_json. It exposes a rich API: from_str(), from_slice(), from_reader(), to_string(), to_string_pretty(), to_vec(), to_writer(), and dozens more.

Your project calls serde_json::from_str() and serde_json::to_string(). That's it. Two functions.

But when rustc compiles serde_json, it doesn't know you only need two functions. It can't. The crate boundary is opaque — rustc compiles each crate independently, treating every public function as a potential entry point. It must generate optimized machine code for all of them.

This isn't a bug. It's how separate compilation works. It's a fundamental architectural decision that enables crates to be compiled independently, cached, and reused. It's the right design.

But it has a cost.

The Separate Compilation Gap

We call this cost the separate compilation gap: the difference between what the compiler must compile (everything visible) and what the program actually needs (everything reachable from main).

Formally, for a compilation unit u:

Gap(u) = (|Visible(u)| - |Reachable(u)|) / |Visible(u)|

Where:

Visible(u) = all symbols the compiler processes (every public function, every impl, every trait method)
Reachable(u) = the subset actually reachable from main() via whole-program call graph analysis

If Gap = 0%, the compiler is doing exactly the right amount of work. If Gap = 50%, half the compiler's effort is wasted.

Measuring Zed's Gap

So what's Zed's gap?

We built a tool that does whole-program reachability analysis across all 198 workspace crates. Starting from main(), it traces every function call, every trait method invocation, every generic instantiation, and marks what's actually needed.

Then we count: how many CPU instructions does the compiler execute with everything vs. only the reachable code?

	Total	Reachable	Gap
CPU instructions	28,559 Ginstr	18,067 Ginstr	37%
Functions analyzed	32,579	23,095	29%

37% of the CPU instructions the compiler executes when building Zed are spent compiling code that no one will ever call.

Let that sink in. More than a third of the compiler's work is wasted. That's not a rounding error. That's not a micro-optimization waiting to happen. That's 10 trillion CPU instructions, burned for nothing, on every clean build.

And this isn't just Zed:

Project	LOC	Instructions (Base)	Instructions (Reachable)	Gap
zed	500K	28,559 Ginstr	18,067 Ginstr	37%
rustc	600K	5,746 Ginstr	4,268 Ginstr	26%
zeroclaw	86K	1,507 Ginstr	1,314 Ginstr	13%
helix	100K	2,256 Ginstr	2,004 Ginstr	11%
ripgrep	50K	314 Ginstr	298 Ginstr	5%
nushell	200K	3,695 Ginstr	3,682 Ginstr	0.4%
bevy	300K	3,807 Ginstr	3,791 Ginstr	0.4%

Some projects have tiny gaps. Bevy and nushell use almost everything they import — good for them. But Zed has a 37% gap, rustc has 26%, and even zeroclaw (a smaller project) wastes 13%.

Why Some Gaps Are Bigger

The gap depends on how a project uses its dependencies.

Large gap projects like Zed have many library crates with broad APIs, but the binary only touches a fraction. Zed pulls in hundreds of crates for its editor, terminal, collaboration, and AI features. Each crate is compiled in full, even though Zed's binary only uses specific code paths.

Small gap projects like bevy use their dependencies more thoroughly. A game engine that imports a math library probably uses most of the math functions. There's less waste.

There's also an interesting amplification effect. In Rust, generics are monomorphized — each generic function gets compiled once per concrete type it's used with. When you stub an unreachable function, you also eliminate all its downstream monomorphizations. That's why Zed's instruction gap (37%) is larger than its function gap (29%) — each stubbed function cascades into many eliminated monomorphizations.

The Honest Assessment

Here's the uncomfortable truth: the Rust compiler isn't slow. It's doing too much work.

And it's doing too much work because separate compilation — the very architecture that makes Cargo fast for incremental builds and enables the crates ecosystem — prevents the compiler from knowing what's actually needed.

Link-Time Optimization (LTO) can eliminate dead code after compilation, but it doesn't reduce the compilation phase itself. The work has already been done.

What we need is something that works before compilation. Something that tells the compiler, at the crate boundary, "here's exactly which public functions are actually called from downstream — you can skip the rest."

So How Do We Close the Gap?

We know the gap exists. We can measure it precisely. For Zed, 37% of the compiler's work is provably unnecessary.

The question is: can we build a tool that, operating purely as a RUSTC_WRAPPER with no compiler modifications, identifies unreachable functions and eliminates them before LLVM codegen?

And can we do it without breaking anything?

To be continued in Part III: Closing the Gap...

This is Part II of a three-part series on cargo-slicer. Part I set up the problem. Part III provides the solution.

Speeding Up Rust Builds: Part III — Closing the Gap

Part III of III. Part I: The Waiting Game | Part II: The Gap

Previously

In Part I, we saw that building Zed takes 17 minutes and no existing optimization really helps. In Part II, we discovered why: 37% of the compiler's work is spent on unreachable code — the "separate compilation gap."

Now let's close it.

The Approach: Four Steps

The idea is simple in principle: figure out what's reachable from main() across all crates, then tell the compiler to skip everything else. In practice, there are a few details to get right.

We call this approach PRECC (Predictive Precompilation Cutting), and it works in four phases:

Step 1: Extract

Before compilation starts, we scan all workspace crate sources and build a unified cross-crate call graph. For Rust, we use a syn-based parser that extracts function definitions, call sites, and public API surfaces from every .rs file. This takes a few seconds, even for large projects.

cargo-slicer pre-analyze    # builds the cross-crate call graph

For Zed, this produces a graph covering all 198 workspace crates: which functions exist, which functions call which, and which are publicly exported.

Step 2: Analyze

Starting from main(), we run a BFS (breadth-first search) through the call graph. Every function reachable from main is marked as "needed." Everything else is marked as "unreachable."

We're careful about special cases. Drop implementations? Always needed (the compiler inserts drop calls implicitly). Trait implementations? Always needed (dynamic dispatch via dyn Trait can call them). #[no_mangle] FFI functions? Always needed. Closures, async functions, unsafe functions? Always needed. We maintain 9 categories of exclusions to be safe.

The result: a precise set of functions that can be safely eliminated from each crate.

Step 3: Predict

Here's where it gets interesting. Naively, you'd think "just cut everything unreachable." But our analysis has overhead — loading the driver, traversing MIR, doing cache I/O. For crates with very few unreachable functions, this overhead exceeds the savings.

We learned this the hard way. Applying cutting to every crate in bevy slows the build by 4.4%. The gap is only 0.4%, and the analysis overhead eats the tiny savings.

So for each crate, we predict: will cutting save more time than the analysis costs? If yes, cut. If no, skip — compile it normally.

Our baseline heuristic is simple:

If the predicted number of stubbable functions is 0: skip.
If it's less than 5 AND the stub ratio is under 2%: skip.
Otherwise: cut.

This gets us 92-100% precision on projects that benefit, and correctly skips projects where cutting would hurt.

Step 4: Cut

For crates marked "cut," we intercept the compiler. Operating as a RUSTC_WRAPPER, we hook into rustc after type checking and replace unreachable function bodies with MIR-level abort stubs. The function signature remains (so downstream crates can still reference it), but the body is replaced with a single abort() instruction.

When rustc's monomorphization collector encounters a stubbed function, it finds no callees — no downstream functions, no generic instantiations, nothing to compile. The entire subtree of the mono graph is pruned. LLVM never sees it.

No source code is modified. No Cargo.toml changes. No feature flags. The compiler simply does less work.

# The full command
cargo-slicer pre-analyze
CARGO_SLICER_VIRTUAL=1 CARGO_SLICER_CODEGEN_FILTER=1 \
  RUSTC_WRAPPER=$(which cargo_slicer_dispatch) \
  cargo +nightly build --release

Or, even simpler:

cargo-slicer.sh /path/to/your/project

The Results

So how much faster is Zed?

Project	Baseline	With PRECC	Wall-clock	Instructions	Peak Memory
zed	1,012s	719s	-29%	-37%	-45%
rustc	135.8s	112.4s	-17%	-26%	-7%
zeroclaw	192.9s	170.4s	-12%	-13%	-11%
helix	71.2s	66.6s	-6%	-11%	-18%
ripgrep	11.1s	10.7s	-4%	-5%	-6%
nushell	106.5s	108.9s	+2.3%	-0.4%	—
bevy	81.8s	85.4s	+4.4%	-0.4%	—

Zed's build drops from 17 minutes to 12 minutes. That's 5 minutes saved on every clean build. 45% less memory. 37% fewer CPU instructions.

The Rust compiler itself builds 17% faster. Helix, 6%. Ripgrep, 4%.

And look at the last two rows. Nushell and bevy have tiny gaps (0.4%), so the prediction step correctly identifies them as not worth cutting. Without prediction, bevy would be 4.4% slower — the overhead exceeds the savings. With prediction, we avoid that regression entirely.

The Honest Part

Let me be upfront about what this tool doesn't do:

Incremental builds: cargo-slicer targets fresh/clean builds. For incremental cargo check and small changes, rustc's built-in incremental compilation is already fast. We're solving the CI/fresh-build problem.
Small projects: if your project is 5,000 lines with 3 dependencies, the gap is tiny and the overhead isn't worth it. This tool shines on larger codebases (50K+ LOC).
Correctness guarantee: we replace function bodies with abort stubs. If our reachability analysis is wrong and a "stubbed" function gets called at runtime, the program will abort. In practice, our 9 safety exclusion categories prevent this — we've tested on all benchmark projects — but it's worth knowing.
Nightly only: the MIR-level hooks require unstable rustc APIs, so a nightly toolchain is required.

Try It

Install with one command:

curl -fsSL https://raw.githubusercontent.com/yijunyu/cargo-slicer/main/install.sh | bash

Then build any Rust project:

cargo-slicer.sh /path/to/your/project

That's it. No config files, no source changes, no Cargo.toml edits. Point it at any Rust project with a Cargo.toml and see what happens.

What We'd Love to Hear

We're researchers, not fortune tellers. The benchmark numbers above are from our test machine (48-core, 128 GB RAM, Linux). Your mileage will vary depending on your project's dependency structure, your hardware, and the phase of the moon.

We genuinely want to know how this works on your project. Does it speed things up? Does it break something? Is the gap large or small? Every data point helps us improve.

Reach out:

GitHub Issues: github.com/yijunyu/cargo-slicer/issues — bug reports, benchmark results, feature requests
Email: yijun.yu@open.ac.uk — for detailed results, collaboration, or just to say hello

We're particularly interested in projects with 10+ workspace crates and heavy dependency usage — that's where the gap tends to be largest.

The Bigger Picture

The separate compilation gap isn't unique to Rust. We've also applied the same principle to C projects — splitting SQLite's monolithic 256K-line sqlite3.c into 2,503 independent compilation units, achieving a 5.8x speedup via parallelism.

The gap is a property of separate compilation itself, not of any particular language or compiler. Wherever a compiler processes code in isolation without knowing what's actually needed, there's potential waste.

And wherever there's waste, there's opportunity.

This concludes our three-part series on speeding up Rust builds with cargo-slicer.

Thanks for reading. Now go build something — a little faster.

Links: