BOOM Critical-Path Timing Optimization
A CHIA case study that uses an LLM-in-the-loop to raise the clock frequency of a
BOOM core by reshaping its Chisel to shorten the
critical path — without sacrificing instructions-per-cycle (IPC). The full flow
lives in chia/examples/timing_opt.
Overview
The maximum frequency of a synthesized core is set by its critical path — the longest register-to-register logic delay. Shortening it is expert, tedious work: read a synthesis timing report, find the gates on the worst path, and restructure the RTL (re-pipeline, rebalance muxing, precompute, …) to cut delay — all while holding cycle behavior fixed, since a faster clock that costs you IPC may be a net loss. That trade-off is the Iron Law of processor performance (time/program = instructions × cycles/instruction × time/cycle): the win that matters is the product of frequency and IPC, not frequency alone.
timing_opt automates this as a search loop. Each iteration:
loads a parent design variant (its generated Verilog and Genus timing report) from a SQLite store,
asks an LLM to grep the timing report and edit the BOOM Chisel to shorten the BoomTile critical path,
rebuilds the design,
synthesizes BoomTile and runs the Verilator suite in parallel — synthesis measures the new worst-case slack, Verilator gates correctness and estimates IPC impact, and
records the result as a new child branch in a tree of variants.
Results accumulate in a SQLite database (timing.db), a multi-branch tree:
each branch stores its diff against the base RTL, generated Verilog, the Genus
timing report, synthesized area, per-benchmark top-down (TMA) counters, and logs.
You pick which parent to optimize each turn, and the flow produces one child per
invocation — so the search keeps building on its strongest candidates.
Note
Result
Over 15 iterations of the flow, we yield more than a 2x increase in frequency at only 3.3% IPC loss, for a net Iron-Law performance improvement of 1.97x, in the Skywater 130nm process! Read section 5 of our arXiv paper to hear more about our results.
How it works
The per-iteration pipeline
run_improve_timing_loop() runs one parent → child pass. It stages the parent
exactly (so the tree reproduces it), lets the LLM edit, then validates the edit
with a build, a parallel synth + Verilator step, and a debug-retry loop on
failure:
DRIVER (one parent -> one child branch)
|
+-- load parent diff / generated Verilog / timing report from DB
+-- acquire chipyard placement group + chipyard_bash tool [chipyard node]
+-- stage + reset: write parent Verilog, reset chipyard, re-apply parent diff
|
+-- /improve_timing [llm worker]
| (grep the staged timing report, edit Chisel, A/B-test
| candidate edits with the timing_experiment tool)
|
+-- build all thread variants (build-debug retry loop) [chisel node]
|
+-- PARALLEL:
| dispatch BoomTile synthesis (async) [vlsi worker]
| run Verilator suite [verilator nodes]
| on Verilator failure: cancel synth, debug_failure, rebuild, retry
|
+-- collect synthesis result once Verilator is clean
+-- persist child branch: reports, area, syn_obj tarball, TMA counters,
produced timing report (worst-slack columns), parent-vs-child summary
On any failure a finally block reverts chipyard to the parent diff and records
the status; on success the edits are left in place (already persisted).
The timing_experiment A/B tool
A full BoomTile synthesis is expensive, so the LLM does not pay for one to test
each idea. TimingExperimentTool is an MCP tool that runs a cheap sub-block
Genus synthesis on the edited vs. unmodified RTL so the agent can A/B-test an edit
before committing to it. A small sub-block synth can outlast an MCP HTTP
round-trip, so the tool uses a start / poll split:
rebuild_verilog()re-elaborates Chisel to Verilog (make verilog, no C++ build) and caches it;list_modules()/list_modules_parent()enumerate validvlsi_topvalues on the edited / unmodified RTL;start_synth_child(vlsi_top, …)/start_synth_parent(vlsi_top, …)dispatch the two sub-block synths and return handles in sub-seconds — issue both for a parallel A/B comparison;synth_status(handle, max_wait_seconds)polls an in-flight synth, returningrunningor the full area + worst-slack summary on completion.
Each experiment is logged to the llm_experiments table by a head-pinned
ExperimentLogger actor, since the tool’s worker cannot see the head’s disk.
Distributing work across the cluster
The build/Verilator host (Chipyard) is held for one iteration via a placement
group, and the chipyard_bash tool the LLM drives is pinned to the same bundle:
pg = placement_group([{"CPU": 1, "chipyard": 1}], strategy="STRICT_PACK")
ray.get(pg.ready())
pg_opts = {"scheduling_strategy": PlacementGroupSchedulingStrategy(
placement_group=pg, placement_group_bundle_index=0)}
chipyard_bash = BashTool(
name=bash_name, work_dir=CHIPYARD_PATH, task_options=pg_opts,
timeout_seconds=600,
)
Synthesis runs on a separate vlsi worker (requesting the VLSI / Syn
resources) and the Verilator suite on the verilator_run nodes, at the same
time. Synthesis measures worst-case slack; Verilator is primarily a correctness
gate (and an IPC-degradation estimate). If Verilator fails, the in-flight synth is
cancelled, a shared-session debug_failure agent repairs the build, and the
step retries up to --max-debug-retries times.
The loop ships this repo’s chia (plus the example packages) to every
worker via Ray py_modules, so workers import the head’s checkout regardless of
what their Docker image baked in:
RUNTIME_ENV = {
"working_dir": str(_REPO_ROOT),
"py_modules": [
str(_REPO_ROOT / "chia"),
str(_REPO_ROOT / "examples" / "common"),
str(_REPO_ROOT / "examples" / "sky130_vlsi"),
str(_REPO_ROOT / "examples" / "timing_opt"),
],
"excludes": ["/DB/", "verilatorbins/ubench/", "__pycache__", ...],
}
The LLM in the loop
The optimizing agent runs on an llm worker via
ClaudeCodeLLM. It is handed the staged timing report
(too large to inline in the prompt) and two MCP tools: chipyard_bash to grep
the report and edit Chisel, and the timing_experiment A/B tool above:
llm = ClaudeCodeLLM(
model=model, # default: claude-opus-4-8
timeout_seconds=timeout_seconds,
log_dir="/tmp/ray/llm_logs",
logging_name="improve_timing",
extra_cli_args=["--effort", "max"],
)
return llm.prompt(prompt_text, tools)
Three prompt variants ship in prompts/ to steer the Iron-Law trade-off:
improve_timing.md(default) — IPC-neutral edits only: reshape logic to cut the critical path without changing cycle behavior;improve_timing_ironlaw.md— allow IPC-trading moves when they win the iron-law product (frequency × IPC); pass via--prompt-file;improve_timing_ironlaw_noab.md— the iron-law variant for runs without thetiming_experimentA/B tool; pair it with--no-experiment-tool.
The seed flow
With an empty DB there is no parent to optimize, so main() first runs
seed_flow(): it resets chipyard to the unmodified base RTL (empty diff),
builds, synthesizes BoomTile, runs Verilator for the baseline TMA counters, and
stores the result as the baseline branch — the first DB entry and the root of
the variant tree. No LLM editing step.
Setup
Note that running this flow requires creating a logical worker environment with Genus. We do not want to expose details of how to do this for the commercial tool publicly, but if you have Genus licenses, you should feel free to reach out to us for help setting this up.
These steps mirror the example’s README.md. Run them from <repo>/chia
unless noted. Because the synthesis tool (we used Cadence Genus on the open-source
Sky130 PDK) and its collateral are commercial, you must supply your own vlsi
synthesis worker.
1. Head conda env — only the head needs it; workers get chia via Ray
py_modules and the cluster’s Docker images. The env is named timing_loop;
fill in its - -e /path/to/chia line with your checkout first:
conda env create -f examples/timing_opt/env.yml
conda activate timing_loop
2. Fill in the stubs — the example ships obvious placeholders
(/path/to/…, CHANGE_ME_…, ${VAR}) you must replace before the flow
runs end-to-end. See the Paths to fill in checklist in the README; the key
ones are the synthesis-tool binary and PDK collateral in
sky130_vlsi/tools-chia.yml, the timing-report relpaths and collateral paths in
constants.py (or the matching TIMING_OPT_* env vars), and the head IP, EC2
key, and vlsi worker in timing_cluster.yaml.
3. Benchmarks — fetch the Verilator test binaries (the suite reads
asmtests/ and embench/; dramsim_ini/ ships alongside):
git submodule update --init examples/timing_opt/verilatorbins
4. Bring up the cluster — chia up expands ${HEAD_IP}, ${USER}
The reference topology is seven workers
(2 verilator_run + 2 chisel_build + 1 llm + 2 vlsi):
export HEAD_IP=10.0.0.10
chia up examples/timing_opt/timing_cluster.yaml
The first bring-up is slow — it pulls the large Chipyard / Verilator images. See Cluster Configuration Reference for the full schema.
5. Seed the baseline — with an empty DB the flow builds + synthesizes the
unmodified RTL and stores it as the baseline branch. TIMING_OPT_DB_DIR is
required and must be a stable absolute head path; the chia job submit
entrypoint runs under Ray’s job manager and does not inherit your shell, so
pass it (and any non-default collateral paths) via --runtime-env-json rather
than export. Run from the example dir so --working-dir . uploads it:
cd examples/timing_opt
chia job submit --working-dir . \
--runtime-env-json '{"env_vars": {"TIMING_OPT_DB_DIR": "/abs/path/on/head/timing_opt_DB"}}' \
-- python improve_timing.py --seed-only
6. Run a timing-optimization iteration — pick a parent branch to optimize;
each invocation produces one child (<parent>_timing_v<N>, auto-incremented).
Reuse the same --runtime-env-json block so the DB path stays consistent:
chia job submit --working-dir . \
--runtime-env-json '{"env_vars": {"TIMING_OPT_DB_DIR": "/abs/path/on/head/timing_opt_DB"}}' \
-- python improve_timing.py --branch baseline
(With an empty DB you can skip step 5 — the loop auto-seeds the baseline first,
then optimizes it in the same job. Re-run with the same --branch to grow the
tree wider with another sibling.)
7. Inspect results — the reporting script prints worst slack / achievable frequency per branch against a target period:
python examples/timing_opt/scripts/perf_table.py \
--db /abs/path/on/head/timing_opt_DB/timing.db --target <ns>
8. Tear down — when the run is done:
chia down examples/timing_opt/timing_cluster.yaml