Cluster Configuration Reference

A CHIA cluster is described by a single YAML file that you pass to chia up and chia down. This page is a complete reference for every key CHIA reads, a worked example that mixes on-premise and cloud machines, and a walk-through of the exact order in which CHIA runs your commands when it brings a cluster up and tears it down.

Note

CHIA’s YAML deliberately resembles the Ray cluster launcher config so existing Ray configs feel familiar, but with additional support for heterogeneous on-premise setups as well as clusters split across on-premise and cloud providers. Chia currently does not support the following Ray autoscaler keys: min_workers / max_workers, upscaling_speed, idle_timeout_minutes, cluster_synced_files, file_mounts_sync_continuously, provider.type, provider.external_head_ip, and provider.coordinator_address.

Top-level structure

At the top level a config is organized into a handful of sections:

cluster_name: MyCluster          # identifier for this cluster

provider:                        # head machine (required)
    head_ip: ...

auth:                            # how to SSH into the machines
    ssh_user: ...
    ssh_private_key: ...

available_node_types:            # logical worker types + their resources
    my_worker:
        ...

aws_nodes:                       # optional: provision EC2 instances
    ...
gcp_nodes:                       # optional: provision GCP instances
    ...
tunnel_defaults:                 # optional: tunnel/port tuning for cloud nodes
    ...

# lifecycle command hooks (see "Command execution order" below)
initialization_commands: [...]
head_env_commands: [...]
setup_commands: [...]
head_setup_commands: [...]
head_teardown_commands: [...]
head_start_ray_commands: [...]
worker_start_ray_commands: [...]

# file syncing
file_mounts: {...}
rsync_exclude: [...]
rsync_filter: [...]

Any ${VAR} reference in a string value is expanded from your environment when the config is loaded (e.g. ${USER}). A bare $VAR is left as-is so it can be evaluated later on the remote shell.

Top-level keys

Key

Default

Meaning

cluster_name

"default"

Identifier for the head and workers of this cluster.

provider

required

Cluster head node. See provider.

auth

{}

SSH credentials for reaching the machines. See auth.

available_node_types

{}

Logical worker types, their Ray resources, and container images. See available_node_types.

initialization_commands

[]

Commands run first, each in its own SSH session, on the host (outside any container). They do not share an environment with each other or with later steps.

setup_commands

[]

Global setup commands run inside the main script session on every nod (head and workers). Inside the container for containerized workers.

head_env_commands

[]

Environment activation prepended to the head’s main script on both chia up and chia down (e.g. source ~/.bashrc && conda activate chia_env).

head_setup_commands

[]

Head-only one-time setup, run during chia up in the head’s main script.

head_teardown_commands

[]

Head-only commands run during chia down before ray stop.

head_start_ray_commands

[]

Commands that start Ray on the head (typically ray stop then ray start --head ...).

worker_start_ray_commands

[]

Commands that start Ray on each worker. CHIA injects --resources (and, for tunneled cloud workers, pinned ports) automatically.

file_mounts

{}

{remote_path: local_path} directories rsync’d to each node before the main script.

rsync_exclude

[]

Patterns passed to rsync --exclude (e.g. **/.git).

rsync_filter

[]

Filter files (e.g. .gitignore) passed to rsync --filter.

docker

None

A cluster-wide default container config (see Container config), which individual node types can override. Specify at most one of the two.

aws_nodes

None

Provision EC2 instances and tunnel them into the cluster. See Cloud nodes.

gcp_nodes

None

Provision GCP Compute Engine instances. See Cloud nodes.

tunnel_defaults

None

Tunnel/port-pinning defaults applied to every auto-tunneled cloud node. See Cloud nodes.

provider

The provider section declares the head machine.

provider:
    head_ip: ${HEAD_IP}

Key

Default

Meaning

head_ip

required

Hostname or IP of the machine that manages the cluster (runs the Ray head).

auth

The auth section gives the SSH credentials CHIA uses to reach every machine, with optional per-host overrides.

auth:
    ssh_user: ${USER}
    ssh_private_key: /home/${USER}/.ssh/${USER}   # omit if your key is in ssh-agent
    overrides:
        some-host:
            ssh_user: ubuntu
            ssh_private_key: ~/.ssh/other_key

Key

Default

Meaning

ssh_user

""

Default SSH username for all machines.

ssh_private_key

None

Default private key path. Omit it if the relevant keys are already loaded into your SSH agent.

overrides

{}

Per-IP overrides, keyed by hostname/IP (or a @node_type:index placeholder). Each entry may set ssh_user, ssh_private_key, and a tunnel block (see Cloud nodes). CHIA populates tunnel overrides for cloud nodes automatically.

available_node_types

Each entry under available_node_types defines a logical worker type: the Ray resources it advertises, how many of them to run, where they may run, and the container (if any) they run in.

available_node_types:
    verilator_run:
        resources: {"verilator_run": 8}
        num_workers: 4
        compatible_ips: [machine9, machine10, machine11, machine12]
        worker_env_commands: ["source ~/.bashrc && conda activate chia_env"]
        docker:
            image: "ghcr.io/ucb-bar/chia-verilator-run:latest"
            container_name: "chia-verilator-run-${USER}"
            run_options:
                - --ulimit nofile=65536:65536
                - --shm-size=10.24gb

Key

Default

Meaning

resources

{}

Custom Ray resources advertised by each worker of this type, e.g. {"verilator_run": 8}. A @ChiaFunction requesting these resources is scheduled onto a matching worker, consuming the amount it requests.

num_workers

1

How many workers of this type to launch. For Ray-config familiarity, if num_workers is absent CHIA falls back to max_workers, then min_workers, then 1.

max_workers

optional

Ray-config alias for num_workers, used only when num_workers is absent (and it takes precedence over min_workers). CHIA does not autoscale — this sets a single fixed worker count, not an upper bound.

min_workers

optional

Ray-config alias for num_workers, used only when both num_workers and max_workers are absent. Not a lower bound; if you set min_workers and max_workers to different values, min_workers is ignored and max_workers wins.

compatible_ips

required if num_workers > 0

The machines this type’s workers may run on. Accepts @node_type:index placeholders.

worker_env_commands

[]

Per-type environment activation prepended to the worker’s main script on both chia up and chia down. Runs inside the container for containerized types.

worker_setup_commands

[]

Per-type one-time setup run during chia up in the worker’s main script.

docker

None

Container config for this type, overriding any cluster-wide default. Specify at most one. See Container config.

balance_level

"cluster"

How this type spreads across its eligible IPs: cluster packs around whatever other types already placed (fewest nodes globally); worker distributes this type’s own workers as evenly as possible across its IP pool, regardless of if the IP may be shared with another type.

Container config

A docker: block may appear cluster-wide at the top level or inside any node type; the node-type block overrides the cluster-wide one.

docker:
    image: "ghcr.io/ucb-bar/chia-verilator-run:latest"
    container_name: "chia-verilator-run-${USER}"
    pull_before_run: False
    pull_timeout: 3600
    run_options:
        - --ulimit nofile=65536:65536
        - --shm-size=10.24gb
        - "-v $SSH_AUTH_SOCK:/ssh-agent"
    run_setup_commands:
        - cd /home/ray/ && git pull

Key

Default

Meaning

image

required

Container image URI

container_name

"chia_container"

Base container name. CHIA appends the worker index (-0, -1, …) so multiple workers of a type don’t collide. Include ${USER} on shared machines.

pull_before_run

True

Pull the image before running. Set False to use a cached image.

pull_timeout

600

Seconds to allow for the pull. Raise it for large images.

run_options

[]

Extra flags passed to docker run (ulimits, shm size, volume mounts, --user, env vars, …).

run_setup_commands

[]

Commands run inside the container after it starts, before the worker’s main script (e.g. clone/pull a repo, fix up /etc/passwd).

Note

Scripts run over SSH as a non-interactive login shell (bash --login): /etc/profile and ~/.bash_profile are sourced, but ~/.bashrc is not (and many ~/.bashrc files bail out early for non-interactive shells). If you rely on conda/venv set up in ~/.bashrc, source it explicitly in head_env_commands / worker_env_commands, e.g. source ~/.bashrc && conda activate chia_env.

Cloud nodes

CHIA can provision public-cloud machines and reverse-tunnels them to the head. Declare them under aws_nodes (EC2) and/or gcp_nodes (Compute Engine). Everything downstream of provisioning — placeholder expansion, tunnel injection, and the tunnels themselves — is provider-agnostic.

aws_nodes

aws_nodes:
    region: us-east-1
    verilator_run_aws:
        KeyName: my-keypair          # an EC2 key pair in your account
        InstanceType: c5.9xlarge
        count: 3
        ImageId: ami-0ec10929233384c7f
        ssh_user: ubuntu
        ssh_private_key: /home/${USER}/my-keypair.pem
        setup_commands:
            - "echo ${GITHUB_TOKEN} | docker login ghcr.io -u myuser --password-stdin"
        BlockDeviceMappings:         # passed through to EC2 RunInstances
            - DeviceName: /dev/sda1
              Ebs:
                  VolumeSize: 500
                  VolumeType: gp3

region is a section-level key (default us-west-2). Every other key lives under a named node type:

Key

Default

Meaning

KeyName

required

EC2 key pair name (must already exist in the account).

InstanceType

required

EC2 instance type (e.g. c5.9xlarge).

count

required

Number of instances to launch for this type.

ImageId

Ubuntu 22.04 AMI

AMI to launch.

ssh_user

None

SSH user for the AMI (e.g. ubuntu). Injected into the auth override for each provisioned IP.

ssh_private_key

None

Private key path matching KeyName. Injected into the auth override.

skip_default_setup

False

Skip CHIA’s default setup (git/conda/docker install) and run only your setup_commands.

setup_commands

[]

Commands run on the EC2 host before it joins the cluster (appended to the defaults unless skipped).

setup_timeout

1800

Seconds allowed for setup.

ssh_timeout

120

Seconds to wait for SSH to come up.

(anything else)

Unknown keys (e.g. BlockDeviceMappings, UserData) are passed straight through to the EC2 RunInstances call.

AWS API access (always required). CHIA picks up credentials by default from ~/.aws/credentials / ~/.aws/config. Override these paths with the environment variables AWS_CONFIG_FILE=/path/to/aws/config and AWS_SHARED_CREDENTIALS_FILE=/path/to/aws/credentials before calling chia up. Instances launch into the account’s default VPC.

gcp_nodes

gcp_nodes is the Compute Engine analog of aws_nodes.

gcp_nodes:
    project: your-gcp-project        # required
    zone: us-central1-a              # default
    gcp_worker:
        machine_type: n1-standard-1
        count: 2
        ssh_user: chia                          # local user created on the VM
        ssh_public_key: ${HOME}/.ssh/id_ed25519.pub
        ssh_private_key: ${HOME}/.ssh/id_ed25519
        spot: true
        disk_size_gb: 100

project (required), zone (default us-central1-a), and network / subnetwork (default to the project’s default VPC) are section-level keys. Every other key lives under a named node type:

Key

Default

Meaning

machine_type

required

GCE machine type (e.g. n1-standard-1).

count

required

Number of instances to launch for this type.

image

Ubuntu image

Boot image (family or full image URL).

zone

section zone

Per-type zone override.

disk_size_gb

image default

Boot disk size in GB.

spot

False

Launch as Spot/preemptible VMs (cheaper, can be reclaimed).

ssh_user

None

Login user CHIA connects as (see Authentication below).

ssh_private_key

None

Private key path CHIA’s SSH client uses. Recommended (otherwise it falls back to your ssh-agent / ~/.ssh defaults).

ssh_public_key

None

Public key injected into the VM (metadata method).

use_os_login

False

Use OS Login instead of metadata SSH keys (see Authentication below).

skip_default_setup

False

Skip CHIA’s default host setup (git/conda/docker) and run only your setup_commands.

setup_commands

[]

Commands run on the host before it joins the cluster (appended to the defaults unless skipped).

setup_timeout

1800

Seconds allowed for setup.

ssh_timeout

120

Seconds to wait for SSH to come up.

(anything else)

Merged into the instance definition sent to the Compute API.

Authentication. A GCP bring-up uses two distinct credentials at two layers:

  • GCP API access (always required). Set it up once with gcloud auth application-default login (or point GOOGLE_APPLICATION_CREDENTIALS at a service-account JSON). A default VPC network must already exist (or set network).

  • SSH into the instance, chosen per node type by use_os_login:

    • Metadata SSH keys (default). CHIA connects to the GCP instance as ssh_user with the matching private key to the public key ssh_public_key. This is an ordinary keypair (the GCP analog of an AWS KeyName), not tied to any Google identity, and it is silently ignored if the project or org enforces OS Login.

    • OS Login (use_os_login: true). Ties access to a GCP identity via IAM. You must register your ssh key manually (gcloud compute os-login ssh-keys add) and set ssh_user to the derived posix username. Use this when your org enforces OS Login.

Referencing cloud nodes (@ placeholders)

Because cloud IPs aren’t known until provisioning, you refer to them by placeholder of the form @<node_type>:<index>, where <node_type> is a key under aws_nodes / gcp_nodes and <index> is 0-based. Placeholders are valid anywhere an IP is — in a node type’s compatible_ips and in auth.overrides keys:

available_node_types:
    verilator_run_aws:
        resources: {"verilator_run": 32}
        num_workers: 2
        compatible_ips: ["@verilator_run_aws:0", "@verilator_run_aws:1"]
        docker: {...}

tunnel_defaults

For every cloud IP, CHIA automatically adds an auth.overrides entry with a tunnel (a per-IP auth.overrides[ip].tunnel you set yourself still wins). tunnel_defaults overrides the default ports/behavior for all auto-tunneled nodes. It accepts any tunnel field except tunnel_ip (which CHIA assigns per-worker). Common ones:

tunnel_defaults:
    ray_worker_port_min: 20000
    ray_worker_port_max: 20001
    head_worker_port_min: 21000
    head_worker_port_max: 21001

Other tunnel fields (with defaults) include gcs_tunnel_port (16379), ray_node_manager_port (16800), ray_object_manager_port (16801), tool_port_min/max (18000/18010), head_tool_port_min/max (8000/8010), head_node_manager_port (29800), head_object_manager_port (29801), kill_orphaned_tunnels (true), and pre_tunnel_commands (sshd GatewayPorts + file-limit setup, run once per physical cloud IP). A typo in any field name fails loudly at load time.

A mixed on-prem + cloud example

This config keeps the head and several worker types on owned machines while bursting Verilator simulation onto AWS. To run purely on-prem, delete the aws_nodes section and the @verilator_run_aws:* placeholders; to add more cloud capacity, raise count and add matching placeholders.

cluster_name: ChiaClusterExample

available_node_types:

    # On-prem Verilator workers, pinned to specific machines.
    verilator_run:
        resources: {"verilator_run": 8}
        num_workers: 4
        compatible_ips: [machine0, machine1, machine2, machine2]
        worker_env_commands: ["source ~/.bashrc && conda activate chia_env"]
        docker:
            image: "ghcr.io/ucb-bar/chia-verilator-run:latest"
            container_name: "chia-verilator-run-${USER}"
            run_options:
                - --ulimit nofile=65536:65536
                - --shm-size=10.24gb

    # Cloud Verilator workers — provisioned by the aws_nodes block below.
    verilator_run_aws:
        resources: {"verilator_run": 32}
        num_workers: 3
        compatible_ips:
            - "@verilator_run_aws:0"
            - "@verilator_run_aws:1"
            - "@verilator_run_aws:2"
        docker:
            image: "ghcr.io/ucb-bar/chia-verilator-run:latest"
            container_name: "chia-verilator-run-${USER}"
            pull_before_run: True
            run_options:
                - --ulimit nofile=65536:65536
                - --shm-size=10.24gb

    # On-prem VLSI workers (no container; uses the host conda env).
    vlsi:
        resources: {"VLSI": 1, "syn": 1, "cacti": 4}
        num_workers: 6
        compatible_ips: [machine1, machine2, machine3, machine4, machine5, machine6]
        worker_env_commands:
            - "source ~/.bashrc && source /ecad/tools/vlsi.bashrc && conda activate chia_env"

# Provision the cloud half of the cluster.
aws_nodes:
    region: us-east-1
    verilator_run_aws:
        KeyName: my-keypair
        InstanceType: c5.9xlarge
        count: 3
        ImageId: ami-0ec10929233384c7f
        ssh_user: ubuntu
        ssh_private_key: /home/${USER}/my-keypair.pem
        setup_commands:
            - "echo ${GITHUB_TOKEN} | docker login ghcr.io -u myuser --password-stdin"
        BlockDeviceMappings:
            - DeviceName: /dev/sda1
              Ebs:
                  VolumeSize: 500
                  VolumeType: gp3

# Pin the tunnel ports for the cloud nodes.
tunnel_defaults:
    ray_worker_port_min: 20000
    ray_worker_port_max: 20001
    head_worker_port_min: 21000
    head_worker_port_max: 21001

provider:
    type: local
    head_ip: machine7
    # No worker_ips: the worker pool is the union of every node type's
    # compatible_ips below (on-prem hosts + the cloud @-placeholders).

auth:
    ssh_user: ${USER}
    ssh_private_key: /home/${USER}/.ssh/${USER}

head_env_commands: ["source ~/.bashrc && conda activate chia_env"]

head_start_ray_commands:
    - ray stop
    - ray start --head --port=6379 --include-dashboard=True --dashboard-agent-listen-port=0

worker_start_ray_commands:
    - ray stop
    - ray start --address=$RAY_HEAD_IP:6379 --dashboard-agent-listen-port=0

Command execution order

When you run chia up, CHIA sets up the head node, assigns each declared worker to a machine (constrained compatible_ips types first, then unconstrained — see assign_nodes in chia/cluster/config.py), establishes SSH tunnels for any cloud nodes, and then sets up the workers (in parallel across machines, sequentially within a machine).

chia up — head node

All head commands run on the host over SSH (the head node is never containerized):

1. initialization_commands        ← each in its own SSH session
2. file_mounts rsync              ← separate rsync processes
3. ┌─── single SSH session (env persists) ───┐
   │ head_env_commands                       │  e.g. conda activate
   │ setup_commands                          │  global
   │ head_setup_commands                     │
   │ head_start_ray_commands                 │  ray stop; ray start --head
   └─────────────────────────────────────────┘

chia up — worker node

Without a container, everything runs on the host:

1. initialization_commands        ← each in its own SSH session
2. file_mounts rsync              ← separate rsync processes
3. ┌─── single SSH session (env persists) ───────────────┐
   │ <type>.worker_env_commands                          │  per node type
   │ setup_commands                                      │  global
   │ <type>.worker_setup_commands                        │  per node type
   │ export RAY_HEAD_IP=...                              │
   │ worker_start_ray_commands  (--resources injected)   │
   └─────────────────────────────────────────────────────┘

With a container, the host pulls/starts the container first, then the main script runs inside it:

1. initialization_commands        ← HOST, each in its own SSH session
2. file_mounts rsync              ← HOST, separate rsync processes
3. docker setup            ← HOST (pull, run, run_setup_commands inside)
4. ┌─── single session INSIDE CONTAINER (env persists) ────┐
   │ <type>.worker_env_commands                            │
   │ setup_commands                                        │
   │ <type>.worker_setup_commands                          │
   │ export RAY_HEAD_IP=...                                │
   │ worker_start_ray_commands  (--resources injected)     │
   └───────────────────────────────────────────────────────┘

For cloud workers, CHIA additionally runs pre_tunnel_commands once per physical cloud IP and brings up the reverse SSH tunnel before the worker’s main script, and pins the Ray ports in worker_start_ray_commands.

chia down

Workers are torn down first (in parallel), then the head:

Workers (with container):              Workers (no container):
1. docker exec:                        1. ┌─ single SSH session ───────┐
     <type>.worker_env_commands           │ <type>.worker_env_commands │
     ray stop                             │ ray stop                   │
2. docker stop <container>                └────────────────────────────┘
3. docker rm -f <container>

Head (after all workers):
┌─── single SSH session ────┐
│ head_env_commands         │
│ head_teardown_commands    │
│ ray stop                  │
└───────────────────────────┘

Note

head_env_commands and the per-type worker_env_commands run on both chia up and chia down — they are for environment activation. Use the *_setup_commands hooks for one-time setup. When head_ip is also listed in a node type’s compatible_ips (so the head also hosts a worker) and that worker isn’t containerized, CHIA skips ray stop on the worker so it doesn’t kill the head’s Ray process.