Cluster Configuration Reference
A CHIA cluster is described by a single YAML file that you pass to chia up
and chia down. This page is a complete reference for every key CHIA reads,
a worked example that mixes on-premise and cloud machines, and a
walk-through of the exact order in which CHIA runs your commands when it brings
a cluster up and tears it down.
Note
CHIA’s YAML deliberately resembles the Ray cluster launcher
config so existing Ray configs feel familiar, but with additional support for heterogeneous on-premise setups as well as clusters split across on-premise and cloud providers. Chia currently does not support the following Ray autoscaler keys: min_workers / max_workers, upscaling_speed,
idle_timeout_minutes, cluster_synced_files,
file_mounts_sync_continuously, provider.type,
provider.external_head_ip, and provider.coordinator_address.
Top-level structure
At the top level a config is organized into a handful of sections:
cluster_name: MyCluster # identifier for this cluster
provider: # head machine (required)
head_ip: ...
auth: # how to SSH into the machines
ssh_user: ...
ssh_private_key: ...
available_node_types: # logical worker types + their resources
my_worker:
...
aws_nodes: # optional: provision EC2 instances
...
gcp_nodes: # optional: provision GCP instances
...
tunnel_defaults: # optional: tunnel/port tuning for cloud nodes
...
# lifecycle command hooks (see "Command execution order" below)
initialization_commands: [...]
head_env_commands: [...]
setup_commands: [...]
head_setup_commands: [...]
head_teardown_commands: [...]
head_start_ray_commands: [...]
worker_start_ray_commands: [...]
# file syncing
file_mounts: {...}
rsync_exclude: [...]
rsync_filter: [...]
Any ${VAR} reference in a string value is expanded from your environment
when the config is loaded (e.g. ${USER}). A bare $VAR is left as-is so
it can be evaluated later on the remote shell.
Top-level keys
Key |
Default |
Meaning |
|---|---|---|
|
|
Identifier for the head and workers of this cluster. |
|
required |
Cluster head node. See provider. |
|
|
SSH credentials for reaching the machines. See auth. |
|
|
Logical worker types, their Ray resources, and container images. See available_node_types. |
|
|
Commands run first, each in its own SSH session, on the host (outside any container). They do not share an environment with each other or with later steps. |
|
|
Global setup commands run inside the main script session on every nod (head and workers). Inside the container for containerized workers. |
|
|
Environment activation prepended to the head’s main script on both |
|
|
Head-only one-time setup, run during |
|
|
Head-only commands run during |
|
|
Commands that start Ray on the head (typically |
|
|
Commands that start Ray on each worker. CHIA injects |
|
|
|
|
|
Patterns passed to rsync |
|
|
Filter files (e.g. |
|
|
A cluster-wide default container config (see Container config), which individual node types can override. Specify at most one of the two. |
|
|
Provision EC2 instances and tunnel them into the cluster. See Cloud nodes. |
|
|
Provision GCP Compute Engine instances. See Cloud nodes. |
|
|
Tunnel/port-pinning defaults applied to every auto-tunneled cloud node. See Cloud nodes. |
provider
The provider section declares the head machine.
provider:
head_ip: ${HEAD_IP}
Key |
Default |
Meaning |
|---|---|---|
|
required |
Hostname or IP of the machine that manages the cluster (runs the Ray head). |
auth
The auth section gives the SSH credentials CHIA uses to reach every machine,
with optional per-host overrides.
auth:
ssh_user: ${USER}
ssh_private_key: /home/${USER}/.ssh/${USER} # omit if your key is in ssh-agent
overrides:
some-host:
ssh_user: ubuntu
ssh_private_key: ~/.ssh/other_key
Key |
Default |
Meaning |
|---|---|---|
|
|
Default SSH username for all machines. |
|
|
Default private key path. Omit it if the relevant keys are already loaded into your SSH agent. |
|
|
Per-IP overrides, keyed by hostname/IP (or a |
available_node_types
Each entry under available_node_types defines a logical worker type: the
Ray resources it advertises, how many of them to run, where they may run, and
the container (if any) they run in.
available_node_types:
verilator_run:
resources: {"verilator_run": 8}
num_workers: 4
compatible_ips: [machine9, machine10, machine11, machine12]
worker_env_commands: ["source ~/.bashrc && conda activate chia_env"]
docker:
image: "ghcr.io/ucb-bar/chia-verilator-run:latest"
container_name: "chia-verilator-run-${USER}"
run_options:
- --ulimit nofile=65536:65536
- --shm-size=10.24gb
Key |
Default |
Meaning |
|---|---|---|
|
|
Custom Ray resources advertised by each worker of this type, e.g.
|
|
|
How many workers of this type to launch. For Ray-config familiarity, if
|
|
optional |
Ray-config alias for |
|
optional |
Ray-config alias for |
|
required if |
The machines this type’s workers may run on. Accepts |
|
|
Per-type environment activation prepended to the worker’s main script on
both |
|
|
Per-type one-time setup run during |
|
|
Container config for this type, overriding any cluster-wide default. Specify at most one. See Container config. |
|
|
How this type spreads across its eligible IPs: |
Container config
A docker: block may appear cluster-wide at the top level or
inside any node type; the node-type block overrides the cluster-wide one.
docker:
image: "ghcr.io/ucb-bar/chia-verilator-run:latest"
container_name: "chia-verilator-run-${USER}"
pull_before_run: False
pull_timeout: 3600
run_options:
- --ulimit nofile=65536:65536
- --shm-size=10.24gb
- "-v $SSH_AUTH_SOCK:/ssh-agent"
run_setup_commands:
- cd /home/ray/ && git pull
Key |
Default |
Meaning |
|---|---|---|
|
required |
Container image URI |
|
|
Base container name. CHIA appends the worker index ( |
|
|
Pull the image before running. Set |
|
|
Seconds to allow for the pull. Raise it for large images. |
|
|
Extra flags passed to |
|
|
Commands run inside the container after it starts, before the worker’s
main script (e.g. clone/pull a repo, fix up |
Note
Scripts run over SSH as a non-interactive login shell (bash --login):
/etc/profile and ~/.bash_profile are sourced, but ~/.bashrc is
not (and many ~/.bashrc files bail out early for non-interactive shells).
If you rely on conda/venv set up in ~/.bashrc, source it explicitly in
head_env_commands / worker_env_commands, e.g.
source ~/.bashrc && conda activate chia_env.
Cloud nodes
CHIA can provision public-cloud machines and reverse-tunnels them to the head. Declare them under aws_nodes (EC2) and/or
gcp_nodes (Compute Engine). Everything downstream of provisioning —
placeholder expansion, tunnel injection, and the tunnels themselves — is
provider-agnostic.
aws_nodes
aws_nodes:
region: us-east-1
verilator_run_aws:
KeyName: my-keypair # an EC2 key pair in your account
InstanceType: c5.9xlarge
count: 3
ImageId: ami-0ec10929233384c7f
ssh_user: ubuntu
ssh_private_key: /home/${USER}/my-keypair.pem
setup_commands:
- "echo ${GITHUB_TOKEN} | docker login ghcr.io -u myuser --password-stdin"
BlockDeviceMappings: # passed through to EC2 RunInstances
- DeviceName: /dev/sda1
Ebs:
VolumeSize: 500
VolumeType: gp3
region is a section-level key (default us-west-2). Every other key lives
under a named node type:
Key |
Default |
Meaning |
|---|---|---|
|
required |
EC2 key pair name (must already exist in the account). |
|
required |
EC2 instance type (e.g. |
|
required |
Number of instances to launch for this type. |
|
Ubuntu 22.04 AMI |
AMI to launch. |
|
|
SSH user for the AMI (e.g. |
|
|
Private key path matching |
|
|
Skip CHIA’s default setup (git/conda/docker install) and run only your
|
|
|
Commands run on the EC2 host before it joins the cluster (appended to the defaults unless skipped). |
|
|
Seconds allowed for setup. |
|
|
Seconds to wait for SSH to come up. |
(anything else) |
Unknown keys (e.g. |
AWS API access (always required). CHIA picks up credentials by default from ~/.aws/credentials
/ ~/.aws/config. Override these paths with the environment variables AWS_CONFIG_FILE=/path/to/aws/config and AWS_SHARED_CREDENTIALS_FILE=/path/to/aws/credentials before calling chia up. Instances launch into the account’s default VPC.
gcp_nodes
gcp_nodes is the Compute Engine analog of aws_nodes.
gcp_nodes:
project: your-gcp-project # required
zone: us-central1-a # default
gcp_worker:
machine_type: n1-standard-1
count: 2
ssh_user: chia # local user created on the VM
ssh_public_key: ${HOME}/.ssh/id_ed25519.pub
ssh_private_key: ${HOME}/.ssh/id_ed25519
spot: true
disk_size_gb: 100
project (required), zone (default us-central1-a), and network /
subnetwork (default to the project’s default VPC) are section-level keys.
Every other key lives under a named node type:
Key |
Default |
Meaning |
|---|---|---|
|
required |
GCE machine type (e.g. |
|
required |
Number of instances to launch for this type. |
|
Ubuntu image |
Boot image (family or full image URL). |
|
section |
Per-type zone override. |
|
image default |
Boot disk size in GB. |
|
|
Launch as Spot/preemptible VMs (cheaper, can be reclaimed). |
|
|
Login user CHIA connects as (see Authentication below). |
|
|
Private key path CHIA’s SSH client uses. Recommended (otherwise it falls
back to your ssh-agent / |
|
|
Public key injected into the VM (metadata method). |
|
|
Use OS Login instead of metadata SSH keys (see Authentication below). |
|
|
Skip CHIA’s default host setup (git/conda/docker) and run only your
|
|
|
Commands run on the host before it joins the cluster (appended to the defaults unless skipped). |
|
|
Seconds allowed for setup. |
|
|
Seconds to wait for SSH to come up. |
(anything else) |
Merged into the instance definition sent to the Compute API. |
Authentication. A GCP bring-up uses two distinct credentials at two layers:
GCP API access (always required). Set it up once with
gcloud auth application-default login(or pointGOOGLE_APPLICATION_CREDENTIALSat a service-account JSON). AdefaultVPC network must already exist (or setnetwork).SSH into the instance, chosen per node type by
use_os_login:Metadata SSH keys (default). CHIA connects to the GCP instance as
ssh_userwith the matching private key to the public keyssh_public_key. This is an ordinary keypair (the GCP analog of an AWSKeyName), not tied to any Google identity, and it is silently ignored if the project or org enforces OS Login.OS Login (
use_os_login: true). Ties access to a GCP identity via IAM. You must register your ssh key manually (gcloud compute os-login ssh-keys add) and setssh_userto the derived posix username. Use this when your org enforces OS Login.
Referencing cloud nodes (@ placeholders)
Because cloud IPs aren’t known until provisioning, you refer to them by
placeholder of the form @<node_type>:<index>, where <node_type> is a key
under aws_nodes / gcp_nodes and <index> is 0-based. Placeholders are
valid anywhere an IP is — in a node type’s compatible_ips and in
auth.overrides keys:
available_node_types:
verilator_run_aws:
resources: {"verilator_run": 32}
num_workers: 2
compatible_ips: ["@verilator_run_aws:0", "@verilator_run_aws:1"]
docker: {...}
tunnel_defaults
For every cloud IP, CHIA automatically adds an auth.overrides entry with a
tunnel (a per-IP auth.overrides[ip].tunnel you set yourself still wins).
tunnel_defaults overrides the default
ports/behavior for all auto-tunneled nodes. It accepts any tunnel field except
tunnel_ip (which CHIA assigns per-worker). Common ones:
tunnel_defaults:
ray_worker_port_min: 20000
ray_worker_port_max: 20001
head_worker_port_min: 21000
head_worker_port_max: 21001
Other tunnel fields (with defaults) include gcs_tunnel_port (16379),
ray_node_manager_port (16800), ray_object_manager_port (16801),
tool_port_min/max (18000/18010), head_tool_port_min/max
(8000/8010), head_node_manager_port (29800), head_object_manager_port
(29801), kill_orphaned_tunnels (true), and pre_tunnel_commands (sshd
GatewayPorts + file-limit setup, run once per physical cloud IP). A typo in
any field name fails loudly at load time.
A mixed on-prem + cloud example
This config keeps the head and several worker types on owned machines
while bursting Verilator simulation onto AWS. To run purely on-prem,
delete the aws_nodes section and the @verilator_run_aws:* placeholders;
to add more cloud capacity, raise count and add matching placeholders.
cluster_name: ChiaClusterExample
available_node_types:
# On-prem Verilator workers, pinned to specific machines.
verilator_run:
resources: {"verilator_run": 8}
num_workers: 4
compatible_ips: [machine0, machine1, machine2, machine2]
worker_env_commands: ["source ~/.bashrc && conda activate chia_env"]
docker:
image: "ghcr.io/ucb-bar/chia-verilator-run:latest"
container_name: "chia-verilator-run-${USER}"
run_options:
- --ulimit nofile=65536:65536
- --shm-size=10.24gb
# Cloud Verilator workers — provisioned by the aws_nodes block below.
verilator_run_aws:
resources: {"verilator_run": 32}
num_workers: 3
compatible_ips:
- "@verilator_run_aws:0"
- "@verilator_run_aws:1"
- "@verilator_run_aws:2"
docker:
image: "ghcr.io/ucb-bar/chia-verilator-run:latest"
container_name: "chia-verilator-run-${USER}"
pull_before_run: True
run_options:
- --ulimit nofile=65536:65536
- --shm-size=10.24gb
# On-prem VLSI workers (no container; uses the host conda env).
vlsi:
resources: {"VLSI": 1, "syn": 1, "cacti": 4}
num_workers: 6
compatible_ips: [machine1, machine2, machine3, machine4, machine5, machine6]
worker_env_commands:
- "source ~/.bashrc && source /ecad/tools/vlsi.bashrc && conda activate chia_env"
# Provision the cloud half of the cluster.
aws_nodes:
region: us-east-1
verilator_run_aws:
KeyName: my-keypair
InstanceType: c5.9xlarge
count: 3
ImageId: ami-0ec10929233384c7f
ssh_user: ubuntu
ssh_private_key: /home/${USER}/my-keypair.pem
setup_commands:
- "echo ${GITHUB_TOKEN} | docker login ghcr.io -u myuser --password-stdin"
BlockDeviceMappings:
- DeviceName: /dev/sda1
Ebs:
VolumeSize: 500
VolumeType: gp3
# Pin the tunnel ports for the cloud nodes.
tunnel_defaults:
ray_worker_port_min: 20000
ray_worker_port_max: 20001
head_worker_port_min: 21000
head_worker_port_max: 21001
provider:
type: local
head_ip: machine7
# No worker_ips: the worker pool is the union of every node type's
# compatible_ips below (on-prem hosts + the cloud @-placeholders).
auth:
ssh_user: ${USER}
ssh_private_key: /home/${USER}/.ssh/${USER}
head_env_commands: ["source ~/.bashrc && conda activate chia_env"]
head_start_ray_commands:
- ray stop
- ray start --head --port=6379 --include-dashboard=True --dashboard-agent-listen-port=0
worker_start_ray_commands:
- ray stop
- ray start --address=$RAY_HEAD_IP:6379 --dashboard-agent-listen-port=0
Command execution order
When you run chia up, CHIA sets up the head node, assigns each declared
worker to a machine (constrained compatible_ips types first, then
unconstrained — see assign_nodes in chia/cluster/config.py), establishes
SSH tunnels for any cloud nodes, and then sets up the workers (in parallel
across machines, sequentially within a machine).
chia up — head node
All head commands run on the host over SSH (the head node is never containerized):
1. initialization_commands ← each in its own SSH session
2. file_mounts rsync ← separate rsync processes
3. ┌─── single SSH session (env persists) ───┐
│ head_env_commands │ e.g. conda activate
│ setup_commands │ global
│ head_setup_commands │
│ head_start_ray_commands │ ray stop; ray start --head
└─────────────────────────────────────────┘
chia up — worker node
Without a container, everything runs on the host:
1. initialization_commands ← each in its own SSH session
2. file_mounts rsync ← separate rsync processes
3. ┌─── single SSH session (env persists) ───────────────┐
│ <type>.worker_env_commands │ per node type
│ setup_commands │ global
│ <type>.worker_setup_commands │ per node type
│ export RAY_HEAD_IP=... │
│ worker_start_ray_commands (--resources injected) │
└─────────────────────────────────────────────────────┘
With a container, the host pulls/starts the container first, then the main script runs inside it:
1. initialization_commands ← HOST, each in its own SSH session
2. file_mounts rsync ← HOST, separate rsync processes
3. docker setup ← HOST (pull, run, run_setup_commands inside)
4. ┌─── single session INSIDE CONTAINER (env persists) ────┐
│ <type>.worker_env_commands │
│ setup_commands │
│ <type>.worker_setup_commands │
│ export RAY_HEAD_IP=... │
│ worker_start_ray_commands (--resources injected) │
└───────────────────────────────────────────────────────┘
For cloud workers, CHIA additionally runs pre_tunnel_commands once per
physical cloud IP and brings up the reverse SSH tunnel before the worker’s main
script, and pins the Ray ports in worker_start_ray_commands.
chia down
Workers are torn down first (in parallel), then the head:
Workers (with container): Workers (no container):
1. docker exec: 1. ┌─ single SSH session ───────┐
<type>.worker_env_commands │ <type>.worker_env_commands │
ray stop │ ray stop │
2. docker stop <container> └────────────────────────────┘
3. docker rm -f <container>
Head (after all workers):
┌─── single SSH session ────┐
│ head_env_commands │
│ head_teardown_commands │
│ ray stop │
└───────────────────────────┘
Note
head_env_commands and the per-type worker_env_commands run on both
chia up and chia down — they are for environment activation. Use the
*_setup_commands hooks for one-time setup. When head_ip is also listed
in a node type’s compatible_ips (so the head also hosts a worker) and that
worker isn’t containerized, CHIA skips ray stop on the worker so it doesn’t
kill the head’s Ray process.