Example: Agentic RoCC Accelerator (MemCpy)
A worked, end-to-end example of an agentic hardware-design loop: an LLM (Claude Code) designs a RISC-V RoCC accelerator in Chisel, CHIA builds it into a MegaBoom SoC, runs it against a bare-metal test, and, on any failure, feeds the error back to the LLM to debug and retry until the test passes.
The example lives at examples/memcpy/ and is essentially self-contained: it
depends only on the installed chia package and the shared DRAMSim2 ini files
in examples/common/dramsim_ini. It is a good starting point for building your
own generate, build, simulate, debug loops on top of CHIA’s chipyard nodes.
What it does
The accelerator is a hardware memcpy: it copies an array of 64-bit elements
from a source region to a destination region, driven by two custom
instructions on opcode custom1. The target design is
MegaBoomV3HumanCommitLogConfig (MegaBoom V3 with the human-readable
commit-log harness, so a failing run yields a readable instruction trace for
the debugger).
The loop runs two nodes in parallel, then builds, runs, and debugs:
test build (parallel) implement (parallel)
copy memcpy.c into Claude writes memcpy.scala
$chipyard/tests, run cmake (the RoCC accelerator) and
-> build/memcpy.riscv wires it into the target
build/memcpy.dump config, via chipyard_bash
└──────────────┬──────────────┘
▼
chisel build (ChiselBuildNode) target: MegaBoomV3HumanCommitLogConfig
▼
verilator run (VerilatorRunNode) memcpy.riscv, +loadmem +verbose
▼
build failed / sim failed / incorrect?
│ yes (≤ NUM_DEBUG_ATTEMPTS) │ no
▼ ▼
debug (Claude) ── rebuild + rerun DONE (passed)
Components
File |
Role |
|---|---|
|
Main orchestration: parallel test-build + implement, then the
build → run → debug loop. Dumps all collateral to |
|
|
|
The implement + debug LLM nodes ( |
|
Run-outcome classification ( |
|
The implement ( |
|
Every tunable knob (loop counts, configs, paths, timeouts, resources). |
|
Minimal cluster: one chisel-build, one verilator, one claude ( |
|
The bare-metal test: issues the two RoCC instructions and checks the copy. |
The accelerator contract
memcpy.c drives the accelerator at opcode custom1 with two
instructions; the implement prompt specifies exactly this so the generated
design matches the test:
funct == 0—rs1= source base address,rs2= destination base address. Latch both; writerd = 1.funct == 1—rs1= array length (number of 64-bit elements). Copy the whole array source → destination via the RoCC memory port; writerd = 1when done.
Correctness is judged from the MEMCPY Num Correct: N line the program
prints: the run passes iff N == DATA_SIZE (constants.DATA_SIZE, kept in
sync with memcpy.c).
Running it
The bundled cluster.yaml brings up three node types — one chisel-build
(chipyard), one verilator (verilator_run), and one claude (llm)
node. The llm node mounts your Claude Code credentials; see the note at the
top of cluster.yaml.
export THIS_MACHINE="your_machine_ip"
chia up examples/memcpy/cluster.yaml
chia job submit -- python $PWD/examples/memcpy/memcpy_loop.py # run from the repo root
chia down examples/memcpy/cluster.yaml
Pass the absolute path to memcpy_loop.py. The driver runs on the cluster head,
where the repo lives, so out/ is written into the real
examples/memcpy/out.
The LLM calls run on the dedicated llm node:
chia.models.claude.ClaudeCodeLLM.prompt() is itself a ChiaFunction, so
the loop dispatches it with llm.prompt.options(resources={"llm": 1.0}) and
threads the session transcript from each call into the next, so the debugger
resumes the implement conversation. (Session persistence for other backends is
in development.)
Tunable parameters
All knobs live in constants.py; container paths and cluster knobs are
MEMCPY_* environment-overridable, so nothing is hardcoded into the loop.
Constant |
Meaning |
|---|---|
|
Max debug-and-retry rounds after the first failure (default 3). |
|
Reserved for a future post-correctness performance-optimization phase (defined, not yet used). |
|
Chisel config to build ( |
|
Element count; must match |
|
Simulation caps so a hung design fails fast. |
|
Ray scheduling tokens, sized so the persistent |
Debug feedback
On a failure the debug node (the same Claude session, resumed) receives:
Build failure — build stderr tail plus stdout windowed on the first
error.Simulation failure (runtime / timeout / incorrect) — the simulator stdout (the commit log) and spike-dasm output tails, plus the last
COMMIT_LOG_TAIL_LINESlines of the commit log and the lastDUMP_TAIL_LINESlines ofmemcpy.dump(the test disassembly).
Output and per-iteration diffs
Every node result and piece of collateral is written to out/, each filename
prefixed with the timestamp at the moment it is written (so files sort by when
they were produced):
20260626_144501_implement.md
20260626_144502_test_build_memcpy.riscv
20260626_144502_test_build_memcpy.dump
20260626_144503_chisel_diff_attempt0.diff # chipyard diff for this iteration
20260626_144503_chisel_diff_attempt0.json # per-repo diff dict
20260626_145012_chisel_build_attempt0.stdout.txt
20260626_145230_verilator_run_attempt0.log
20260626_145231_feedback_attempt1.md
20260626_145950_debug_attempt1.md
20260626_153044_summary.json
Each iteration’s Chisel diff is captured with collect_diff (in
helpers.py) just before that attempt’s build, so it reflects the exact
source built from the implement node’s work on attempt 0, and the cumulative
implement + debug edits on later attempts.