Autonomous CIRCT Issue Solving
A CHIA case study that uses an LLM-in-the-loop to triage the open-issue backlog
of CIRCT — the MLIR-based hardware compiler
infrastructure — and, for each candidate, drive a sequence of agents through
assess → reproduce → fix → verify → (regression repair) → writeup inside a
real CIRCT checkout. The full flow lives in
chia/examples/circt_issue_solver.
Overview
Large open-source projects are facing a new burden: AI tools have enabled a significant increase in outside contributions to large open source GitHub repositories, including new issue reports and pull requests, but has lowered the overall quality of the contributions. As noted in the LLVM project’s AI tool use policy, when these contributions are done entirely by AI in an unprincipled way and without human review, this “extracts work from [maintainers] in the form of design and code review”. CHIA cannot make a contributor review their AI’s work, but it can force the AI to make improvements in a principled, structured way, that leads naturally to high quality bug fixes and PRs.
circt_issue_solver shows this. We take a multi-step approach to fixing an issue which ensures that fixes are only proposed for real bugs, and only for issues where discussion has converged on a clear solution. Furthermore, we run validation programmatically, including the entire regression test suite, and a reproduction script that confirms that the bug has gone from unfixed to fixed based on the LLMs changes. The pipeline works as follows:
First, the head node triages open issues down to a sample of plausible candidates, then runs one independent pipeline per candidate across a pool of CIRCT containers. Each pipeline:
assesses the issue — is this actually a bug, and are both the bug and the correct behavior clear enough to act on autonomously? If not, it logs the reason and skips;
reproduces it — writes
.circtissues/repro.shwith the contract exit 0 iff the bug is fixed, and skips the issue if it doesn’t actually reproduce on the pinned tree;fixes it — edits
/workspace/circt, rebuilds, reruns the repro, and adds a lit regression test;verifies deterministically — rebuilds, reruns the repro, and runs the full lit gate;
on a regression, gets one repair turn to fix the broken tests without un-fixing the bug; and
writes the PR description it would submit.
The output is local: a candidate diff and the PR writeup, persisted to
issue_logs/issue_<N>/ and a row in a SQLite database (issues.db). A
second flow reads PR review feedback (reviewer comments and failing CI) and
produces an updated diff plus the replies it would post. Crucially, neither
flow writes to GitHub — both only read. A human reviews and submits.
Note
Result
We worked with the maintainers of CIRCT to target our flow at a manageable number of issues. It correctly assessed 16 issues, 5 of which were reproducible bugs with clear solutions. It solved all 5. Of those 5, 2 were not eligible for AI assisted PRs (because they were labelled as “good first issues”). We submitted PRs for the other 3 and they have all been upstreamed. Read section 5 of our arXiv paper to hear more about our results.
How it works
The per-issue pipeline
Triage runs on the head; each surviving candidate is fanned out as one
run_issue_remote task pinned to a single CIRCT container. Within that
container the phases run in sequence, and any phase can short-circuit the
pipeline (not_a_bug / unclear / no_repro) so a fix attempt is only
ever spent on an issue that has cleared the gates before it:
HEAD (triage, read-only GithubIssuesNode)
|
+-- list the open backlog, drop: already attempted, feature requests,
| repro-less issues, and issues with an open PR attached
+-- sample max_issues candidates at random
|
+-- fan out one run_issue_remote per candidate across the circt slots
|
v
CIRCT WORKER (one issue per container, pinned)
|
+-- git reset --hard <tag> + warm incremental build
+-- ASSESS [llm] bug? AND is the correct behavior clear? -> skip: not_a_bug / unclear
+-- REPRODUCE [llm] write .circtissues/repro.sh (exit 0 iff fixed) -> skip: no_repro
+-- FIX [llm] edit /workspace/circt, rebuild, rerun repro, add a lit test
+-- VERIFY [deterministic, no LLM] rebuild, rerun repro, run the full lit gate
+-- REGRESSION [llm] only if repro is green but the suite went red — one repair turn
+-- WRITEUP [llm] the PR description it WOULD submit
|
+-- return result dict -> HEAD persists issue_logs/issue_<N>/ + a row in issues.db
Each LLM phase is a fresh, stateless claude --print session: the context a
later phase needs is inlined into its prompt (the repro.sh into the fix
prompt; the diff and verdict into the writeup), rather than carried via
--resume.
Triage on the head
Triage is a cheap heuristic filter over the open backlog using the read-only
GithubIssuesNode. It lists the most
recent open issues without fetching per-issue comments (the filter only needs
title/body/labels), keeps issues that carry a fenced code block plus either a
tool command or a failure signal (crash/assert/miscompile), and drops feature
requests, already-attempted issues, and anything with an open PR attached. There
is deliberately no label gate — whether an issue is really a bug is left to
the per-issue assess phase, which reads the issue and the source:
def has_repro(issue) -> bool:
"""A self-contained repro: a fenced code block + a tool command or a
failure signal (crash/assert/miscompile)."""
body = issue.body or ""
return "```" in body and (bool(_CMD_RE.search(body)) or bool(_SIGNAL_RE.search(body)))
The qualifying set is shuffled and sampled, and only the chosen survivors are re-fetched with their comments attached — so the expensive per-issue requests are paid only for the handful that will actually run.
The deterministic verify gate
To keep fixes principled, the fix is judged by a verify step that uses no
LLM — programmatic orchestration confirms the change works as well as the agent
claims, so the agent cannot talk its way past it or mark its own work fixed.
Verify captures the diff, rebuilds the tool targets, reruns the
repro (which exits 0 only when the bug is fixed), and runs the whole lit suite
minus the categories that are red on the unmodified image (test/CAPI and
Tools/circt-tblgen need binaries this SDK-only build doesn’t ship). An issue
is only marked fixed when the repro is green and the full lit gate is
green:
def circt_lit_gate_paths() -> list[str]:
"""Every top-level test/<category> except the baseline-red ones, so a
gate failure means a real regression, not a missing-binary artifact."""
...
# status = "fixed" iff repro green AND full lit gate green; else "attempted"
If the repro flips green but the suite goes red, the pipeline fires a single regression-repair turn — the failing tests and their output inlined — to mend the regression without un-fixing the bug, then re-verifies. Because the gate runs the whole suite (not just the touched dialect), it catches collateral damage anywhere in the tree and never vacuously passes.
Distributing work across the cluster
Each per-issue task is a CHIA function that holds one whole circt slot, so
exactly one issue runs per CIRCT container. Within the task, every LLM phase is
dispatched onto an llm worker (1.0 per call, so the cluster’s llm
slots cap prompt concurrency) while the bash / build / lit MCP servers stay on
the CIRCT worker and are reached over HTTP — the heavy CIRCT checkout never moves:
@ChiaFunction(resources={"circt": 1})
def run_issue_remote(issue_md: str, number: int, cfg: dict, ...) -> dict:
...
cli = get(llm.prompt.options(resources={"llm": 1.0})
.chia_remote(llm, prompt, tools))
The flow ships this repo’s chia — plus the head-side circt_util.py
and issue_task.py — to every worker via Ray py_modules, so edits to the
flow reach workers on the next submit with no image rebuild:
_CHIA_PKG = FLOW_DIR.parent.parent / "chia"
_PY_MODULES = [str(FLOW_DIR / "circt_util.py"),
str(FLOW_DIR / "issue_task.py"),
str(_CHIA_PKG)]
...
ray.init(address="auto",
runtime_env={"py_modules": _PY_MODULES, "excludes": _RUNTIME_ENV_EXCLUDES})
The general CIRCT build/test primitives and the
BuildTool / LitTool
MCP wrappers live in the chia package (chia.chipyard.circt), so they ride
along with it.
The LLM in the loop
Each phase runs on an llm worker via
ClaudeCodeLLM, under a shared system prompt that
casts it as a senior MLIR/CIRCT engineer and pins the rules that keep a fix
honest — make the smallest root-cause change, never disable a test or
special-case the repro, and treat LLVM/MLIR (the prebuilt SDK) as out of scope
rather than hacking around it. A fresh session is built per phase:
llm = ClaudeCodeLLM(
model=cfg["model"], system_message=cfg["system_prompt"],
timeout_seconds=cfg["timeouts"][phase],
extra_cli_args=["--effort", "max"],
resume_session=True, projects_cwd=None,
)
Alongside the prompt the agent gets MCP tools that execute inside the CIRCT
container: a BashTool to read, edit, and run
shell commands in /workspace/circt; an async BuildTool that starts a
ninja rebuild and is polled to completion; and an async LitTool that runs a
lit regression set. The async build/lit tools return immediately and are polled,
so a long build can’t stall the transport.
The assess phase is the one that most shapes quality, and it is where the
“principled” commitment starts. It separates “is this a bug?” from “is the
correct behavior clear?” — a crash is an obvious defect, but what the tool
should do instead is often a design decision (e.g. reject the input vs. extend the
code to handle it). The phase is asked to enumerate the materially different
ways a maintainer could reasonably resolve the issue; if more than one is
defensible and nothing in the issue, spec, or docs singles one out, it returns
UNCLEAR and the pipeline skips the issue rather than producing a
confidently-wrong patch. That guard is exactly what caught issue #8508 in the
study (a transform run on an extmodule DUT where “diagnose and refuse” and
“extend to support it” were both defensible — see issue_logs/issue_8508/):
the bug was crystal clear, but the fix was a maintainer’s design call, so the
flow correctly declined to guess.
The review flow
A companion flow (review_loop.py → review_task.py) closes the human loop.
Given a PR:ISSUE pair, it reconstructs the PR’s state on the pinned tree
(re-applying the PR’s current diff fetched from GitHub via
GithubPullsNode), then runs
triage → (if actionable) fix → verify → replies over the reviewer comments
and any failing CI checks. A PR that is simply red in CI — with no human
comments — is enough to trigger a round. It produces an updated diff and the
author replies it would post; as with the issue flow, nothing is written back to
GitHub.
Principled fixes and maintainer friendliness
The two commitments from the overview are not afterthoughts — they are wired into the pipeline and into how the flow was actually run, and together they are what let every submitted PR be upstreamed.
Fixes are principled by construction. The shared system prompt holds the agent to the smallest root-cause change and forbids the shortcuts that make an automated “fix” worthless — or worse, harmful — to a reviewer: disabling or weakening a test, special-casing the repro, or hacking around an LLVM/MLIR root cause it cannot legitimately build. Two gates enforce this rather than trusting the agent’s word:
the assess phase (above) refuses to act unless the bug and its correct resolution are unambiguous, by enumerating the reasonable resolutions and bailing to
UNCLEARwhen more than one stands — so a fix is only ever attempted where there is a single clear answer;the verify phase (above) runs with no LLM in the loop at all: programmatic orchestration rebuilds, reruns the repro, and runs the full regression suite, so a change is only ever called
fixedwhen it provably is.
The human — and the maintainers — stay in charge. Neither flow writes to GitHub; every diff and PR writeup is a proposal. In the study a human reviewed each change in detail and hand-wrote the pull request before anything reached upstream CIRCT. The load on the maintainers was bounded just as deliberately: the flow was run on a small, randomly chosen set of issues — in close coordination with the CIRCT maintainers on both the quantity and the quality of contributions — rather than firing patches at the whole backlog. Issues labelled good first issue were deliberately left for human newcomers, in keeping with the project’s norms and the broader LLVM concern that unprincipled AI contributions mostly extract design- and code-review effort from maintainers. When the submitted PRs drew review feedback, that feedback was addressed with the review flow above — the same principled, human-in-the-loop machinery — before resubmitting. See our arXiv paper for the full study.
Targeting a different repository
The repo is the one thing you change to retarget both flows. In config.py:
GITHUB_REPO = "llvm/circt" # read by both flows
It defaults to llvm/circt; the only supported change is pointing it at a
CIRCT fork (e.g. to review PRs on your own fork) — the rest of the flow still
assumes CIRCT’s build and lit conventions.
Note
The chia-circt image is pinned at firtool-1.148.0. Issues fixed
upstream after that tag won’t reproduce, so the reproduce gate correctly marks
them no_repro and the pipeline moves on. Likewise, a root cause that lives
in LLVM/MLIR (the prebuilt SDK / llvm submodule) is out of scope — only
CIRCT’s own tree is buildable here, so the agent reports such cases instead of
working around them.
Setup
These steps mirror the example’s README.md. Run them from the example
directory (<repo>/chia/examples/circt_issue_solver) unless noted.
1. Head conda env — only the head needs this; workers get chia via Ray
py_modules and run the Docker images named in cluster.yaml:
conda env create -f env.yml
conda activate circtissues
2. Cluster — cluster.yaml is single-machine by default: the head plus
four containers on one host, read from CHIA_HEAD. The required worker
counts (min_workers / max_workers):
Node type |
Workers |
Role |
|---|---|---|
|
2 |
runs the assessing / fixing / reviewing LLM (light); image
|
|
2 |
owns a |
Each circt_llm container bind-mounts your Claude Code config into the
container. If your credentials live somewhere other than ~/.claude, edit the
mount in cluster.yaml to match:
run_options:
- "-v ~/.claude:/home/ray/.claude" # mount your Claude config into the container
Scale up by raising the per-type and cluster-wide min/max_workers and adding
IPs to compatible_ips. The default ports (GCS 6379, dashboard 8265) mean only
one CHIA cluster should be up per host at a time. See
Cluster Configuration Reference for the full schema.
3. GitHub token — both flows read issues/PRs through an authenticated client. Set a token with read access to the repo (if using public CIRCT this step can be skipped):
export GITHUB_TOKEN=... # read access to GITHUB_REPO
export CHIA_HEAD=$(hostname) # host to bring the cluster up on
4. Config — in config.py, leave GITHUB_REPO at llvm/circt or
point it at a CIRCT fork.
5. Bring up the cluster — from the example directory with the env active:
chia up cluster.yaml
The first run pulls the chia-circt image, which is large; on slow links the
pull may time out, so raise the pull timeout and retry.
6. Run the issue flow — fix_issues_submit.sh wraps chia job submit so
the driver’s logs land in the Ray dashboard (http://localhost:8265) and in
chia job logs <id>:
./fix_issues_submit.sh --max-issues 2 # triage + attempt 2 candidates
./fix_issues_submit.sh --issue 10568 # one specific issue, skip triage
NO_WAIT=1 ./fix_issues_submit.sh --max-issues 5 # detach; watch the dashboard
7. (Optional) Review flow — feed it a PR number paired with its issue number; the PR’s current diff and feedback are fetched from GitHub:
./review_submit.sh --pr 10648:7388
8. Tear down — when the run is done:
chia down cluster.yaml
Outputs land in issue_logs/issue_<N>/: the candidate fix.diff, the
pr_writeup.md it would submit, a verdict.json (status, repro/build/lit
results, diff counts), the saved repro, and per-phase LLM transcripts
(llm_<phase>.md / .jsonl) — plus one row per attempt in issues.db.
Review-flow outputs land under review_logs/issue_<N>_pr_<M>/ (the updated
diff and the replies it would post). Re-running skips issues already recorded in
issues.db.