BOOM Critical-Path Timing Optimization

A CHIA case study that uses an LLM-in-the-loop to raise the clock frequency of a BOOM core by reshaping its Chisel to shorten the critical path — without sacrificing instructions-per-cycle (IPC). The full flow lives in chia/examples/timing_opt.

Overview

The maximum frequency of a synthesized core is set by its critical path — the longest register-to-register logic delay. Shortening it is expert, tedious work: read a synthesis timing report, find the gates on the worst path, and restructure the RTL (re-pipeline, rebalance muxing, precompute, …) to cut delay — all while holding cycle behavior fixed, since a faster clock that costs you IPC may be a net loss. That trade-off is the Iron Law of processor performance (time/program = instructions × cycles/instruction × time/cycle): the win that matters is the product of frequency and IPC, not frequency alone.

timing_opt automates this as a search loop. Each iteration:

loads a parent design variant (its generated Verilog and Genus timing report) from a SQLite store,
asks an LLM to grep the timing report and edit the BOOM Chisel to shorten the BoomTile critical path,
rebuilds the design,
synthesizes BoomTile and runs the Verilator suite in parallel — synthesis measures the new worst-case slack, Verilator gates correctness and estimates IPC impact, and
records the result as a new child branch in a tree of variants.

Results accumulate in a SQLite database (timing.db), a multi-branch tree: each branch stores its diff against the base RTL, generated Verilog, the Genus timing report, synthesized area, per-benchmark top-down (TMA) counters, and logs. You pick which parent to optimize each turn, and the flow produces one child per invocation — so the search keeps building on its strongest candidates.

Note

Result

Over 15 iterations of the flow, we yield more than a 2x increase in frequency at only 3.3% IPC loss, for a net Iron-Law performance improvement of 1.97x, in the Skywater 130nm process! Read section 5 of our arXiv paper to hear more about our results.

How it works

The per-iteration pipeline

run_improve_timing_loop() runs one parent → child pass. It stages the parent exactly (so the tree reproduces it), lets the LLM edit, then validates the edit with a build, a parallel synth + Verilator step, and a debug-retry loop on failure:

DRIVER (one parent -> one child branch)
  |
  +-- load parent diff / generated Verilog / timing report from DB
  +-- acquire chipyard placement group + chipyard_bash tool   [chipyard node]
  +-- stage + reset: write parent Verilog, reset chipyard, re-apply parent diff
  |
  +-- /improve_timing                                          [llm worker]
  |     (grep the staged timing report, edit Chisel, A/B-test
  |      candidate edits with the timing_experiment tool)
  |
  +-- build all thread variants (build-debug retry loop)       [chisel node]
  |
  +-- PARALLEL:
  |     dispatch BoomTile synthesis (async)                    [vlsi worker]
  |     run Verilator suite                                    [verilator nodes]
  |     on Verilator failure: cancel synth, debug_failure, rebuild, retry
  |
  +-- collect synthesis result once Verilator is clean
  +-- persist child branch: reports, area, syn_obj tarball, TMA counters,
        produced timing report (worst-slack columns), parent-vs-child summary

On any failure a finally block reverts chipyard to the parent diff and records the status; on success the edits are left in place (already persisted).

The `timing_experiment` A/B tool

A full BoomTile synthesis is expensive, so the LLM does not pay for one to test each idea. TimingExperimentTool is an MCP tool that runs a cheap sub-block Genus synthesis on the edited vs. unmodified RTL so the agent can A/B-test an edit before committing to it. A small sub-block synth can outlast an MCP HTTP round-trip, so the tool uses a start / poll split:

rebuild_verilog() re-elaborates Chisel to Verilog (make verilog, no C++ build) and caches it;
list_modules() / list_modules_parent() enumerate valid vlsi_top values on the edited / unmodified RTL;
start_synth_child(vlsi_top, …) / start_synth_parent(vlsi_top, …) dispatch the two sub-block synths and return handles in sub-seconds — issue both for a parallel A/B comparison;
synth_status(handle, max_wait_seconds) polls an in-flight synth, returning running or the full area + worst-slack summary on completion.

Each experiment is logged to the llm_experiments table by a head-pinned ExperimentLogger actor, since the tool’s worker cannot see the head’s disk.

Distributing work across the cluster

The build/Verilator host (Chipyard) is held for one iteration via a placement group, and the chipyard_bash tool the LLM drives is pinned to the same bundle:

pg = placement_group([{"CPU": 1, "chipyard": 1}], strategy="STRICT_PACK")
ray.get(pg.ready())
pg_opts = {"scheduling_strategy": PlacementGroupSchedulingStrategy(
    placement_group=pg, placement_group_bundle_index=0)}
chipyard_bash = BashTool(
    name=bash_name, work_dir=CHIPYARD_PATH, task_options=pg_opts,
    timeout_seconds=600,
)

Synthesis runs on a separate vlsi worker (requesting the VLSI / Syn resources) and the Verilator suite on the verilator_run nodes, at the same time. Synthesis measures worst-case slack; Verilator is primarily a correctness gate (and an IPC-degradation estimate). If Verilator fails, the in-flight synth is cancelled, a shared-session debug_failure agent repairs the build, and the step retries up to --max-debug-retries times.

The loop ships this repo’s chia (plus the example packages) to every worker via Ray py_modules, so workers import the head’s checkout regardless of what their Docker image baked in:

RUNTIME_ENV = {
    "working_dir": str(_REPO_ROOT),
    "py_modules": [
        str(_REPO_ROOT / "chia"),
        str(_REPO_ROOT / "examples" / "common"),
        str(_REPO_ROOT / "examples" / "sky130_vlsi"),
        str(_REPO_ROOT / "examples" / "timing_opt"),
    ],
    "excludes": ["/DB/", "verilatorbins/ubench/", "__pycache__", ...],
}

The LLM in the loop

The optimizing agent runs on an llm worker via ClaudeCodeLLM. It is handed the staged timing report (too large to inline in the prompt) and two MCP tools: chipyard_bash to grep the report and edit Chisel, and the timing_experiment A/B tool above:

llm = ClaudeCodeLLM(
    model=model,                       # default: claude-opus-4-8
    timeout_seconds=timeout_seconds,
    log_dir="/tmp/ray/llm_logs",
    logging_name="improve_timing",
    extra_cli_args=["--effort", "max"],
)
return llm.prompt(prompt_text, tools)

Three prompt variants ship in prompts/ to steer the Iron-Law trade-off:

improve_timing.md (default) — IPC-neutral edits only: reshape logic to cut the critical path without changing cycle behavior;
improve_timing_ironlaw.md — allow IPC-trading moves when they win the iron-law product (frequency × IPC); pass via --prompt-file;
improve_timing_ironlaw_noab.md — the iron-law variant for runs without the timing_experiment A/B tool; pair it with --no-experiment-tool.

The seed flow

With an empty DB there is no parent to optimize, so main() first runs seed_flow(): it resets chipyard to the unmodified base RTL (empty diff), builds, synthesizes BoomTile, runs Verilator for the baseline TMA counters, and stores the result as the baseline branch — the first DB entry and the root of the variant tree. No LLM editing step.

Setup

Note that running this flow requires creating a logical worker environment with Genus. We do not want to expose details of how to do this for the commercial tool publicly, but if you have Genus licenses, you should feel free to reach out to us for help setting this up.

These steps mirror the example’s README.md. Run them from <repo>/chia unless noted. Because the synthesis tool (we used Cadence Genus on the open-source Sky130 PDK) and its collateral are commercial, you must supply your own vlsi synthesis worker.

1. Head conda env — only the head needs it; workers get chia via Ray py_modules and the cluster’s Docker images. The env is named timing_loop; fill in its - -e /path/to/chia line with your checkout first:

conda env create -f examples/timing_opt/env.yml
conda activate timing_loop

2. Fill in the stubs — the example ships obvious placeholders (/path/to/…, CHANGE_ME_…, ${VAR}) you must replace before the flow runs end-to-end. See the Paths to fill in checklist in the README; the key ones are the synthesis-tool binary and PDK collateral in sky130_vlsi/tools-chia.yml, the timing-report relpaths and collateral paths in constants.py (or the matching TIMING_OPT_* env vars), and the head IP, EC2 key, and vlsi worker in timing_cluster.yaml.

3. Benchmarks — fetch the Verilator test binaries (the suite reads asmtests/ and embench/; dramsim_ini/ ships alongside):

git submodule update --init examples/timing_opt/verilatorbins

4. Bring up the cluster — chia up expands ${HEAD_IP}, ${USER} The reference topology is seven workers (2 verilator_run + 2 chisel_build + 1 llm + 2 vlsi):

export HEAD_IP=10.0.0.10
chia up examples/timing_opt/timing_cluster.yaml

The first bring-up is slow — it pulls the large Chipyard / Verilator images. See Cluster Configuration Reference for the full schema.

5. Seed the baseline — with an empty DB the flow builds + synthesizes the unmodified RTL and stores it as the baseline branch. TIMING_OPT_DB_DIR is required and must be a stable absolute head path; the chia job submit entrypoint runs under Ray’s job manager and does not inherit your shell, so pass it (and any non-default collateral paths) via --runtime-env-json rather than export. Run from the example dir so --working-dir . uploads it:

cd examples/timing_opt
chia job submit --working-dir . \
  --runtime-env-json '{"env_vars": {"TIMING_OPT_DB_DIR": "/abs/path/on/head/timing_opt_DB"}}' \
  -- python improve_timing.py --seed-only

6. Run a timing-optimization iteration — pick a parent branch to optimize; each invocation produces one child (<parent>_timing_v<N>, auto-incremented). Reuse the same --runtime-env-json block so the DB path stays consistent:

chia job submit --working-dir . \
  --runtime-env-json '{"env_vars": {"TIMING_OPT_DB_DIR": "/abs/path/on/head/timing_opt_DB"}}' \
  -- python improve_timing.py --branch baseline

(With an empty DB you can skip step 5 — the loop auto-seeds the baseline first, then optimizes it in the same job. Re-run with the same --branch to grow the tree wider with another sibling.)

7. Inspect results — the reporting script prints worst slack / achievable frequency per branch against a target period:

python examples/timing_opt/scripts/perf_table.py \
  --db /abs/path/on/head/timing_opt_DB/timing.db --target <ns>

8. Tear down — when the run is done:

chia down examples/timing_opt/timing_cluster.yaml