Autonomous CIRCT Issue Solving ============================== A CHIA case study that uses an LLM-in-the-loop to triage the open-issue backlog of `CIRCT `_ — the MLIR-based hardware compiler infrastructure — and, for each candidate, drive a sequence of agents through **assess → reproduce → fix → verify → (regression repair) → writeup** inside a real CIRCT checkout. The full flow lives in ``chia/examples/circt_issue_solver``. Overview -------- Large open-source projects are facing a new burden: AI tools have enabled a significant increase in outside contributions to large open source GitHub repositories, including new issue reports and pull requests, but has lowered the overall quality of the contributions. As noted in the LLVM project's AI tool use policy, when these contributions are done entirely by AI in an unprincipled way and without human review, this "extracts work from [maintainers] in the form of design and code review". CHIA cannot make a contributor review their AI's work, but it can force the AI to make improvements in a principled, structured way, that leads naturally to high quality bug fixes and PRs. ``circt_issue_solver`` shows this. We take a multi-step approach to fixing an issue which ensures that fixes are only proposed for real bugs, and only for issues where discussion has converged on a clear solution. Furthermore, we run validation programmatically, including the entire regression test suite, and a reproduction script that confirms that the bug has gone from unfixed to fixed based on the LLMs changes. The pipeline works as follows: First, the head node **triages** open issues down to a sample of plausible candidates, then runs one independent pipeline per candidate across a pool of CIRCT containers. Each pipeline: #. **assesses** the issue — is this actually a bug, and are *both* the bug and the correct behavior clear enough to act on autonomously? If not, it logs the reason and skips; #. **reproduces** it — writes ``.circtissues/repro.sh`` with the contract *exit 0 iff the bug is fixed*, and skips the issue if it doesn't actually reproduce on the pinned tree; #. **fixes** it — edits ``/workspace/circt``, rebuilds, reruns the repro, and adds a lit regression test; #. **verifies** deterministically — rebuilds, reruns the repro, and runs the full lit gate; #. on a regression, gets **one repair turn** to fix the broken tests without un-fixing the bug; and #. writes the **PR description it would submit**. The output is local: a candidate diff and the PR writeup, persisted to ``issue_logs/issue_/`` and a row in a SQLite database (``issues.db``). A second flow reads **PR review feedback** (reviewer comments *and* failing CI) and produces an updated diff plus the replies it would post. Crucially, **neither flow writes to GitHub — both only read.** A human reviews and submits. .. note:: **Result** We worked with the maintainers of CIRCT to target our flow at a manageable number of issues. It correctly assessed 16 issues, 5 of which were reproducible bugs with clear solutions. It solved all 5. Of those 5, 2 were not eligible for AI assisted PRs (because they were labelled as "good first issues"). We submitted PRs for the other 3 and they have all been upstreamed. Read section 5 of our `arXiv paper `_ to hear more about our results. How it works ------------ The per-issue pipeline ~~~~~~~~~~~~~~~~~~~~~~~~ Triage runs on the head; each surviving candidate is fanned out as one ``run_issue_remote`` task pinned to a single CIRCT container. Within that container the phases run in sequence, and any phase can short-circuit the pipeline (``not_a_bug`` / ``unclear`` / ``no_repro``) so a fix attempt is only ever spent on an issue that has cleared the gates before it:: HEAD (triage, read-only GithubIssuesNode) | +-- list the open backlog, drop: already attempted, feature requests, | repro-less issues, and issues with an open PR attached +-- sample max_issues candidates at random | +-- fan out one run_issue_remote per candidate across the circt slots | v CIRCT WORKER (one issue per container, pinned) | +-- git reset --hard + warm incremental build +-- ASSESS [llm] bug? AND is the correct behavior clear? -> skip: not_a_bug / unclear +-- REPRODUCE [llm] write .circtissues/repro.sh (exit 0 iff fixed) -> skip: no_repro +-- FIX [llm] edit /workspace/circt, rebuild, rerun repro, add a lit test +-- VERIFY [deterministic, no LLM] rebuild, rerun repro, run the full lit gate +-- REGRESSION [llm] only if repro is green but the suite went red — one repair turn +-- WRITEUP [llm] the PR description it WOULD submit | +-- return result dict -> HEAD persists issue_logs/issue_/ + a row in issues.db Each LLM phase is a fresh, stateless ``claude --print`` session: the context a later phase needs is inlined into its prompt (the ``repro.sh`` into the fix prompt; the diff and verdict into the writeup), rather than carried via ``--resume``. Triage on the head ~~~~~~~~~~~~~~~~~~~ Triage is a cheap heuristic filter over the open backlog using the read-only :class:`~chia.github.github_issues_node.GithubIssuesNode`. It lists the most recent open issues *without* fetching per-issue comments (the filter only needs title/body/labels), keeps issues that carry a fenced code block plus either a tool command or a failure signal (crash/assert/miscompile), and drops feature requests, already-attempted issues, and anything with an open PR attached. There is deliberately **no label gate** — whether an issue is really a bug is left to the per-issue assess phase, which reads the issue *and* the source: .. code-block:: python def has_repro(issue) -> bool: """A self-contained repro: a fenced code block + a tool command or a failure signal (crash/assert/miscompile).""" body = issue.body or "" return "```" in body and (bool(_CMD_RE.search(body)) or bool(_SIGNAL_RE.search(body))) The qualifying set is shuffled and sampled, and only the chosen survivors are re-fetched with their comments attached — so the expensive per-issue requests are paid only for the handful that will actually run. The deterministic verify gate ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ To keep fixes principled, the fix is judged by a **verify step that uses no LLM** — programmatic orchestration confirms the change works as well as the agent claims, so the agent cannot talk its way past it or mark its own work ``fixed``. Verify captures the diff, rebuilds the tool targets, reruns the repro (which exits 0 only when the bug is fixed), and runs the whole lit suite minus the categories that are red on the unmodified image (``test/CAPI`` and ``Tools/circt-tblgen`` need binaries this SDK-only build doesn't ship). An issue is only marked ``fixed`` when the repro is green **and** the full lit gate is green: .. code-block:: python def circt_lit_gate_paths() -> list[str]: """Every top-level test/ except the baseline-red ones, so a gate failure means a real regression, not a missing-binary artifact.""" ... # status = "fixed" iff repro green AND full lit gate green; else "attempted" If the repro flips green but the suite goes red, the pipeline fires a single **regression-repair** turn — the failing tests and their output inlined — to mend the regression without un-fixing the bug, then re-verifies. Because the gate runs the whole suite (not just the touched dialect), it catches collateral damage anywhere in the tree and never vacuously passes. Distributing work across the cluster ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Each per-issue task is a CHIA function that holds one whole ``circt`` slot, so exactly one issue runs per CIRCT container. Within the task, every LLM phase is dispatched onto an ``llm`` worker (``1.0`` per call, so the cluster's ``llm`` slots cap prompt concurrency) while the bash / build / lit MCP servers stay on the CIRCT worker and are reached over HTTP — the heavy CIRCT checkout never moves: .. code-block:: python @ChiaFunction(resources={"circt": 1}) def run_issue_remote(issue_md: str, number: int, cfg: dict, ...) -> dict: ... cli = get(llm.prompt.options(resources={"llm": 1.0}) .chia_remote(llm, prompt, tools)) The flow ships **this repo's** ``chia`` — plus the head-side ``circt_util.py`` and ``issue_task.py`` — to every worker via Ray ``py_modules``, so edits to the flow reach workers on the next submit with no image rebuild: .. code-block:: python _CHIA_PKG = FLOW_DIR.parent.parent / "chia" _PY_MODULES = [str(FLOW_DIR / "circt_util.py"), str(FLOW_DIR / "issue_task.py"), str(_CHIA_PKG)] ... ray.init(address="auto", runtime_env={"py_modules": _PY_MODULES, "excludes": _RUNTIME_ENV_EXCLUDES}) The general CIRCT build/test primitives and the :class:`~chia.chipyard.circt.BuildTool` / :class:`~chia.chipyard.circt.LitTool` MCP wrappers live in the ``chia`` package (``chia.chipyard.circt``), so they ride along with it. The LLM in the loop ~~~~~~~~~~~~~~~~~~~~ Each phase runs on an ``llm`` worker via :class:`~chia.models.claude.ClaudeCodeLLM`, under a shared system prompt that casts it as a senior MLIR/CIRCT engineer and pins the rules that keep a fix honest — make the smallest root-cause change, never disable a test or special-case the repro, and treat LLVM/MLIR (the prebuilt SDK) as out of scope rather than hacking around it. A fresh session is built per phase: .. code-block:: python llm = ClaudeCodeLLM( model=cfg["model"], system_message=cfg["system_prompt"], timeout_seconds=cfg["timeouts"][phase], extra_cli_args=["--effort", "max"], resume_session=True, projects_cwd=None, ) Alongside the prompt the agent gets MCP tools that execute inside the CIRCT container: a :class:`~chia.base.tools.BashTool.BashTool` to read, edit, and run shell commands in ``/workspace/circt``; an async ``BuildTool`` that starts a ninja rebuild and is polled to completion; and an async ``LitTool`` that runs a lit regression set. The async build/lit tools return immediately and are polled, so a long build can't stall the transport. The assess phase is the one that most shapes quality, and it is where the "principled" commitment starts. It separates "is this a bug?" from "is the correct behavior clear?" — a crash is an obvious defect, but what the tool *should* do instead is often a design decision (e.g. reject the input vs. extend the code to handle it). The phase is asked to **enumerate the materially different ways a maintainer could reasonably resolve the issue**; if more than one is defensible and nothing in the issue, spec, or docs singles one out, it returns ``UNCLEAR`` and the pipeline skips the issue rather than producing a confidently-wrong patch. That guard is exactly what caught issue #8508 in the study (a transform run on an ``extmodule`` DUT where "diagnose and refuse" and "extend to support it" were both defensible — see ``issue_logs/issue_8508/``): the bug was crystal clear, but the *fix* was a maintainer's design call, so the flow correctly declined to guess. The review flow ~~~~~~~~~~~~~~~ A companion flow (``review_loop.py`` → ``review_task.py``) closes the human loop. Given a ``PR:ISSUE`` pair, it reconstructs the PR's state on the pinned tree (re-applying the PR's current diff fetched from GitHub via :class:`~chia.github.github_pulls_node.GithubPullsNode`), then runs **triage → (if actionable) fix → verify → replies** over the reviewer comments *and* any failing CI checks. A PR that is simply red in CI — with no human comments — is enough to trigger a round. It produces an updated diff and the author replies it would post; as with the issue flow, nothing is written back to GitHub. Principled fixes and maintainer friendliness ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ The two commitments from the overview are not afterthoughts — they are wired into the pipeline and into how the flow was actually run, and together they are what let every submitted PR be upstreamed. **Fixes are principled by construction.** The shared system prompt holds the agent to the smallest root-cause change and forbids the shortcuts that make an automated "fix" worthless — or worse, harmful — to a reviewer: disabling or weakening a test, special-casing the repro, or hacking around an LLVM/MLIR root cause it cannot legitimately build. Two gates enforce this rather than trusting the agent's word: - the **assess phase** (above) refuses to act unless the bug *and* its correct resolution are unambiguous, by enumerating the reasonable resolutions and bailing to ``UNCLEAR`` when more than one stands — so a fix is only ever attempted where there is a single clear answer; - the **verify phase** (above) runs with no LLM in the loop at all: programmatic orchestration rebuilds, reruns the repro, and runs the full regression suite, so a change is only ever called ``fixed`` when it provably is. **The human — and the maintainers — stay in charge.** Neither flow writes to GitHub; every diff and PR writeup is a *proposal*. In the study a human reviewed each change in detail and hand-wrote the pull request before anything reached upstream CIRCT. The load on the maintainers was bounded just as deliberately: the flow was run on a small, randomly chosen set of issues — in close coordination with the CIRCT maintainers on both the *quantity* and the *quality* of contributions — rather than firing patches at the whole backlog. Issues labelled *good first issue* were deliberately left for human newcomers, in keeping with the project's norms and the broader LLVM concern that unprincipled AI contributions mostly extract design- and code-review effort from maintainers. When the submitted PRs drew review feedback, that feedback was addressed with the **review flow** above — the same principled, human-in-the-loop machinery — before resubmitting. See our `arXiv paper `__ for the full study. Targeting a different repository ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ The repo is the one thing you change to retarget both flows. In ``config.py``: .. code-block:: python GITHUB_REPO = "llvm/circt" # read by both flows It defaults to ``llvm/circt``; the only supported change is pointing it at a CIRCT **fork** (e.g. to review PRs on your own fork) — the rest of the flow still assumes CIRCT's build and lit conventions. .. note:: The ``chia-circt`` image is pinned at **firtool-1.148.0**. Issues fixed upstream after that tag won't reproduce, so the reproduce gate correctly marks them ``no_repro`` and the pipeline moves on. Likewise, a root cause that lives in LLVM/MLIR (the prebuilt SDK / ``llvm`` submodule) is out of scope — only CIRCT's own tree is buildable here, so the agent reports such cases instead of working around them. Setup ----- These steps mirror the example's ``README.md``. Run them from the example directory (``/chia/examples/circt_issue_solver``) unless noted. **1. Head conda env** — only the head needs this; workers get ``chia`` via Ray ``py_modules`` and run the Docker images named in ``cluster.yaml``: .. code-block:: bash conda env create -f env.yml conda activate circtissues **2. Cluster** — ``cluster.yaml`` is single-machine by default: the head plus four containers on one host, read from ``CHIA_HEAD``. The required worker counts (``min_workers`` / ``max_workers``): .. list-table:: :header-rows: 1 :widths: 22 12 66 * - Node type - Workers - Role * - ``circt_llm`` - 2 - runs the assessing / fixing / reviewing LLM (light); image ``chia-claude-code`` * - ``circt_worker`` - 2 - owns a ``/workspace/circt`` checkout, builds + runs lit (heavy); image ``chia-circt`` Each ``circt_llm`` container bind-mounts your Claude Code config into the container. If your credentials live somewhere other than ``~/.claude``, edit the mount in ``cluster.yaml`` to match: .. code-block:: yaml run_options: - "-v ~/.claude:/home/ray/.claude" # mount your Claude config into the container Scale up by raising the per-type and cluster-wide ``min/max_workers`` and adding IPs to ``compatible_ips``. The default ports (GCS 6379, dashboard 8265) mean only one CHIA cluster should be up per host at a time. See :doc:`/user_guides/cluster_config_reference` for the full schema. **3. GitHub token** — both flows read issues/PRs through an authenticated client. Set a token with read access to the repo (if using public CIRCT this step can be skipped): .. code-block:: bash export GITHUB_TOKEN=... # read access to GITHUB_REPO export CHIA_HEAD=$(hostname) # host to bring the cluster up on **4. Config** — in ``config.py``, leave ``GITHUB_REPO`` at ``llvm/circt`` or point it at a CIRCT fork. **5. Bring up the cluster** — from the example directory with the env active: .. code-block:: bash chia up cluster.yaml The first run pulls the ``chia-circt`` image, which is large; on slow links the pull may time out, so raise the pull timeout and retry. **6. Run the issue flow** — ``fix_issues_submit.sh`` wraps ``chia job submit`` so the driver's logs land in the Ray dashboard (``http://localhost:8265``) and in ``chia job logs ``: .. code-block:: bash ./fix_issues_submit.sh --max-issues 2 # triage + attempt 2 candidates ./fix_issues_submit.sh --issue 10568 # one specific issue, skip triage NO_WAIT=1 ./fix_issues_submit.sh --max-issues 5 # detach; watch the dashboard **7. (Optional) Review flow** — feed it a PR number paired with its issue number; the PR's current diff and feedback are fetched from GitHub: .. code-block:: bash ./review_submit.sh --pr 10648:7388 **8. Tear down** — when the run is done: .. code-block:: bash chia down cluster.yaml Outputs land in ``issue_logs/issue_/``: the candidate ``fix.diff``, the ``pr_writeup.md`` it would submit, a ``verdict.json`` (status, repro/build/lit results, diff counts), the saved repro, and per-phase LLM transcripts (``llm_.md`` / ``.jsonl``) — plus one row per attempt in ``issues.db``. Review-flow outputs land under ``review_logs/issue__pr_/`` (the updated diff and the replies it would post). Re-running skips issues already recorded in ``issues.db``.