Architecture Overview

A CHIA project is a composition of two pieces: a flow and a cluster.

A flow is an orchestration script that defines a pipeline of tasks — the what to do.
A cluster is the collection of computers those tasks are scheduled onto — the where and how it runs.

CHIA flows

A flow can be described as a graph: the nodes are tasks (functions) and the edges are the data and control flow between them. CHIA treats both programmatic and agentic edges as first-class primitives, and any function can be driven either way.

Nodes

A node is a Python function — any Python function can be a node. Each node is tagged with the resources it needs from the worker it runs on. A node is held until (a) all of its inputs are ready and (b) a logical worker is available whose resources exceed the node’s demands. Once it can be scheduled, its arguments are serialized and sent to that worker to execute.

Programmatic edges

A node meant to run programmatically is annotated with @ChiaFunction(resources=...) and dispatched with fn.chia_remote(args). The call returns control to the caller immediately (asynchronous), handing back a reference you can later wait on and collect with get(), or pass directly as an argument to another node (an explicit edge). This enables asynchronous execution of many different nodes. A @ChiaFunction called directly — without chia_remote — just runs in the caller’s own process and isn’t a node.

Agentic edges with tools

A node can also be exposed to an agent as an MCP tool, registered on a ChiaTool object via ChiaTool.mcp.add_tool(...). The ChiaTool stands up an MCP server that hosts its registered tools; the server itself runs on a worker chosen by the task_options given when it is created. By default a registered function runs on the worker hosting the tool server, but registering ChiaTool.mcp.add_tool(fn.chia_remote_blocking) lets the tool function execute remotely according to its own resources — so a tool server can live on one worker while its tools run on another. Every LLM/agent node exposes a query method that takes a list of ChiaTools to offer the model, and each tool’s docstring becomes the description the model sees.

CHIA clusters

A flow needs diverse infrastructure working in tandem: different nodes need different physical hardware (FireSim needs FPGAs) and different software environments (dependencies, credentials, isolation). The cluster provides this.

Physical machines and logical workers

You start with a set of physical machines — these can be heterogeneous, with different operating systems, memory capacity, accelerators, etc. On top of them you define logical workers that expose virtualized hardware resources (CPU cores, GPUs, accelerators, FPGAs) and software resources (dependencies, credentials, isolation). An LLM worker, for instance, carries the dependencies for an agentic query (e.g. a Claude Code CLI and the provider’s credentials); a FireSim-runner worker would expose an FPGA resource and only map onto machines that have one. Workers are mapped onto machines when the cluster comes online — multiple workers can share a machine, and CHIA supports several allocation strategies for load balancing.

Containerization

Logical workers can run inside containers. This gives isolation between workers, front-loads environment setup, and makes clusters portable.

Cloud integration

Public-cloud machines can be folded directly into a cluster to add compute on demand. Spanning owned (on-prem) and borrowed (cloud) resources makes for cost-effective, efficient clusters. In firewalled environments, CHIA uses SSH reverse tunneling to connect local and cloud machines, with some small limitations on orchestration.

The cluster config and CLI

A cluster is specified in a YAML file: available physical machines under a provider section, logical-worker types (with their exposed resources and container images) under available_node_types, and any cloud instances under a provider section such as aws_nodes. Bring it up with chia up <config>.yaml — which spawns any cloud nodes, assigns workers to machines, and launches the workers — and tear it down with chia down <config>.yaml. See the CLI Reference.

Fault tolerance & reproducibility

CHIA is built for long, expensive runs:

Worker / machine failure is detected; no new work is scheduled there, and any tasks that were running are automatically re-queued onto another worker with the required resources. Clusters can grow at runtime — chia up --add adds resources to a live cluster (and reintegrates recovered nodes) with no downtime.
Process-leak prevention — CHIA tracks the processes a node spawns and stops them when the node or flow is stopped or cancelled.
Caching & bypass — a flow can reuse a node’s cached result instead of recomputing it, so it can restart quickly from anywhere after a crash and bypass nondeterministic nodes.
Profiling & visualization record what ran where and for how long.

Infrastructure

CHIA curates a set of existing tools into a single fabric. The Ray distributed-computing platform is the substrate beneath CHIA’s flows and clusters — providing scheduling, distributed execution, fault tolerance, and data collection. Ray was chosen over alternatives like LangGraph, the Microsoft Agent Framework, and Apache Airflow for its expressive control-flow semantics, fine-grained flexible scheduling, and distributed execution and fault tolerance. Around it, CHIA uses Docker for containerization, FastMCP for agent tools, Boto3 and the Google Cloud client libraries for cloud integration, and TensorBoard, Weights & Biases, and GraphViz for profiling and visualization.

A particular strength is how cleanly CHIA fits diverse components together. There are already CHIA nodes for hardware-design tools — Chipyard, FireSim, Hammer, CIRCT, gem5, ChampSim, and Verilator — for a wide range of LLM providers and local model serving (AWS Bedrock, Google Vertex, OpenAI and Anthropic APIs, Fireworks, Groq, OpenRouter, Ollama, vLLM) and agent CLIs (Claude Code, OpenAI Codex, Opencode, Google Antigravity), and for supporting tasks like maintaining relational databases, compiling software, and collating GitHub issues.