gem5 ↔ BOOM Microarchitecture Alignment ======================================= A CHIA case study that uses an LLM-in-the-loop to tune a `gem5 `_ performance model until it matches a target `BOOM `_ core. The full flow lives in ``chia/examples/gem5_align``. Overview -------- Architects routinely keep a fast, high-level microarchitectural simulator (gem5) and a cycle-exact RTL implementation (a BOOM core in Chipyard) of the *same* microarchitecture. The two drift apart: the gem5 model mispredicts cycle counts because its configuration — and sometimes its C++ source — no longer reflects the RTL. **Alignment** is the work of closing that gap, and it's a huge lift, even for large engineering teams. It is nearly impossible for small teams and academics, and usually the simulators do not stay aligned, or are never aligned in the first place. ``gem5_align`` automates it as a search loop. Each iteration: #. restores a *parent* gem5 source + config state sampled from the best results so far, #. asks an LLM to edit the gem5 configuration and/or ``src/`` to better match BOOM, #. rebuilds gem5, #. runs a benchmark suite, and #. compares gem5 cycle counts against cached **Verilator golden** counts to get a per-benchmark ``%diff``. Results are written to a SQLite database (``alignment.db``); the next iteration samples its parent from the top entries, so the search keeps building on its strongest candidates. ``N`` iterations run concurrently — one per physical gem5 node — via a Ray placement group. .. note:: **Result** We ran alignment on medium BOOM for 10.5 days, and achieved a gem5 core and configuration whose cycle counts were accurate to under 3% on average across our benchmarks! Read section 5 of our `arXiv paper `_ to hear more about our results. How it works ------------ The per-iteration flow ~~~~~~~~~~~~~~~~~~~~~~~~ The head runs ``N`` iterations concurrently — one per physical gem5 node. A single placement group with ``N`` ``STRICT_SPREAD`` bundles pins each bundle to a distinct node, and a thread pool on the head drives one iteration per bundle. When an iteration finishes, the bundle is freed and the next iteration is dispatched onto it with a freshly sampled parent:: HEAD THREAD (one per bundle) | +-- sample parent uniformly from DB.top_k_entries(2) +-- restore_gem5_state(parent.config, parent.diff) [gem5 bundle] +-- rebuild_gem5() [gem5 bundle] | +-- align_node(...) [llm worker] | (analyze parent's results, read BOOM source, edit config/source) | +-- rebuild_gem5() [gem5 bundle] +-- run_gem5_comparison() [gem5 bundle] | (debug_node loop on failures) | +-- persist IterationResult to DB + logs, dispatch next iteration The Verilator golden cache ~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Alignment needs a ground truth to score against. ``ensure_verilator_cache`` builds the target Chipyard config once on a ``chisel_build`` node, runs every microbenchmark on ``verilator_run`` nodes, and caches the result (``.log`` + ``.out``) on the head. Subsequent runs reuse the cache, so the expensive RTL simulation happens only once: .. code-block:: python def ensure_verilator_cache() -> bool: """If verilator results are not cached, build BUILD_CONFIG and run all benchmarks.""" benchmarks = sorted(p.stem.replace(".verilator", "") for p in UBENCH_BUILD.glob("*.verilator.riscv")) missing = [b for b in benchmarks if not (VERILATOR_CACHE / f"{b}.log").exists() or not (VERILATOR_CACHE / f"{b}.out").exists()] if not missing: print(f"[verilator cache] All {len(benchmarks)} benchmarks cached. Skipping build.") return True # else: build BUILD_CONFIG on a chisel node, dispatch every missing # benchmark to verilator nodes, and cache .log / .out ... Distributing gem5 work across the cluster ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Each gem5 node is one ``STRICT_SPREAD`` bundle of a shared placement group, with a canonical :class:`~chia.simulators.gem5.Gem5Node` pinned to it: .. code-block:: python gem5_pg = placement_group([{"CPU": 1, "gem5": 1}] * N, strategy="STRICT_SPREAD") ray.get(gem5_pg.ready()) bundle_nodes = [Gem5Node(placement_group=gem5_pg, bundle_index=i) for i in range(N)] The gem5 build / run / restore steps are ordinary CHIA functions tagged with a *fractional* ``gem5`` resource, so a gem5 op and its co-located tool actors share a single ``{CPU: 1, gem5: 1}`` bundle rather than each grabbing a whole node: .. code-block:: python @ChiaFunction(resources={"gem5": 0.5}) def rebuild_gem5() -> tuple[bool, str, float, str, str, str]: ... The loop ships **this repo's** ``chia`` to every worker via Ray ``py_modules``, so workers import the head's checkout regardless of what their Docker image baked in: .. code-block:: python _RUNTIME_ENV = { "py_modules": [str(_CHIA_PKG)], # ship THIS repo's chia to every worker "excludes": ["**/__pycache__", "**/*.pyc"], } ... ray.init(address=os.environ.get("RAY_ADDRESS", "auto"), runtime_env=_RUNTIME_ENV) The LLM in the loop ~~~~~~~~~~~~~~~~~~~~ The aligning agent runs on an ``llm`` worker via :class:`~chia.models.claude.ClaudeCodeLLM`. It is given the per-benchmark cycle-count comparison table (gem5 vs Verilator ``%diff``), the benchmark descriptions, and the run history. To diagnose *why* a benchmark is off, it also gets two key diagnostic signals: - **O3PipeView pipeline traces** — gem5 runs under ``--debug-flags=O3PipeView``, and the parent iteration's traces are staged on the worker so the agent can diff gem5 retire cycles against Verilator commit cycles at matching PCs; - **performance counters** — BOOM's top-down (TMA) counters from the Verilator run (``divider_active``, ``stq_full``, ``dcache_miss``, ``br_mispredict``, …), which it cross-checks against the corresponding gem5 ``stats.txt`` counters to localize the mis-modeled mechanism. Alongside these it has MCP tools to read the BOOM Chisel source, edit and rebuild the gem5 tree, and quick-run gem5 on a few benchmarks to test a hypothesis. It then proposes edits to the gem5 config and/or C++ source: .. code-block:: python llm = ClaudeCodeLLM( model="claude-opus-4-6", timeout_seconds=3600, logging_name="align", resume_session=session_id is not None, extra_cli_args=["--effort", "max"], ) A separate ``debug_node`` resumes the *same* CLI session if a rebuild or run fails, so the agent can iteratively fix its own changes within one iteration. Targeting a different configuration ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ The target is the one thing you change to retarget the whole flow. In ``config.py``, ``BUILD_CONFIG`` is the full Chipyard config class (built on the chisel node and surfaced to the LLM) and ``CONFIG_SLUG`` suffixes on-disk artifacts so multiple targets can coexist: .. code-block:: python BUILD_CONFIG = "SmallBoomV3HumanCommitLogTMAConfig" # full Chipyard config class CONFIG_SLUG = "smallboom" # suffixes on-disk artifacts Setup ----- These steps mirror the example's ``README.md``. Run them from ``/chia`` unless noted. **1. Head conda env** — run from the example directory: .. code-block:: bash cd examples/gem5_align conda env create -f env.yml conda activate gem5-align Only the head needs this environment; workers get chia via Ray ``py_modules`` and run the Docker images named in ``cluster.yaml``. If you rename the env, update ``cluster.yaml`` to match. **2. Cluster** — in ``cluster.yaml``, set ``provider.head_ip`` (the host running ``chia up`` / the Ray head) and each node type's ``compatible_ips``. The required worker counts (``max_workers``): .. list-table:: :header-rows: 1 :widths: 20 12 68 * - Node type - Workers - Role * - ``llm`` - 6 - aligning/debugging LLM (light) * - ``chisel_build`` - 1 - builds the verilator golden cache (heavy — Chipyard) * - ``verilator_run`` - 4 - runs verilator goldens * - ``gem5_worker`` - 6 - builds + runs gem5 ``compatible_ips`` is the **pool** of hosts a type's workers may run on, not a 1:1 list — multiple workers (and multiple node types) co-locate on one host as separate containers, so you need fewer hosts than the worker total. ``head_ip`` is a single string; ``compatible_ips`` is a list (``${ENV_VAR}`` is expanded): .. code-block:: yaml provider: head_ip: 10.0.0.10 available_node_types: gem5_worker: compatible_ips: ["10.0.0.11", "10.0.0.12"] # inline, block, or ${VAR} Every listed host must be reachable via the SSH credentials in the top-level ``auth`` section (``ssh_user``, plus optional ``ssh_private_key`` path), have Docker, and run an SSH agent. The ``llm`` hosts bind-mount your Claude Code config via the ``-v :/home/ray/.claude`` run option. All four node types are required. See :doc:`/user_guides/cluster_config_reference` for the full schema. **3. Benchmarks** — fetch the examples/benchmarks submodule, then compile the ``ubench`` suite the loop runs (with the step-1 ``gem5-align`` env active — it provides the ``riscv64-unknown-elf-`` toolchain): .. code-block:: bash git submodule update --init examples/benchmarks cd examples/benchmarks/ubench && ./compile.sh # -> build/.{gem5.elf,verilator.riscv} Without ``build/`` populated, the verilator cache step finds zero benchmarks and silently skips (see the note above). **4. Config** — in ``config.py``, set ``BUILD_CONFIG`` + ``CONFIG_SLUG`` together to your target. Optional env overrides: ``GEM5_ALIGN_BENCH_ROOT``, ``GEM5_ALIGN_LOG_DIR``. **5. Bring up the cluster** — from ``/chia`` with the env active: .. code-block:: bash chia up examples/gem5_align/cluster.yaml This can take a while on the first run, since the Chipyard Docker image is large to pull. On slow links the pull may time out; raise the pull timeout and retry (progress is saved if you restart quickly). **6. Launch the alignment job** — ``GEM5_ALIGN_VERILATOR_CACHE`` is **required** and names the verilator golden cache on the head (an existing dir to reuse, or a fresh writable dir to generate on the first run). The entrypoint runs under Ray's job manager and does **not** inherit your shell, so pass it via ``--runtime-env-json`` rather than ``export``. Run from the example dir so ``--working-dir .`` uploads its files: .. code-block:: bash cd examples/gem5_align chia job submit --working-dir . \ --runtime-env-json '{"env_vars": {"GEM5_ALIGN_VERILATOR_CACHE": "/abs/path/on/head"}}' \ -- python gem5_align_loop.py Add ``GEM5_ALIGN_BENCH_ROOT`` / ``GEM5_ALIGN_LOG_DIR`` to ``env_vars`` only if overriding their defaults. **7. Tear down** — when the run is done: .. code-block:: bash chia down examples/gem5_align/cluster.yaml Outputs land in ``GEM5_ALIGN_LOG_DIR`` (default ``examples/gem5_align/align_loop_logs-/``): per-iteration ``iter_N/`` artifacts, ``alignment.db``, and TensorBoard ``metrics/``. Re-running resumes from a non-empty ``alignment.db``.