Caching and Bypass
CHIA’s cache mechanism persists a function’s real output to disk, and
bypass replaces a function’s computation with pre-recorded data while
still dispatching the call through Ray. These mechanisms enable redundant and costly work to be skipped on reruns, easy testing of cluster orchestration, and improved fault tolerance by storing in-progress results. Both are built around ChiaFunction,
keyed by a per-call tag (_chia_tag), and configured in the same YAML
file passed to your loop.
At a high level, the cache is the write path (populate from a real run) and bypass is the read path (replay it). They share the tag.
Bypass
When a function is bypassed it is still dispatched according to its resource requirements, but the real computation is replaced by a provider that returns pre-recorded data. This lets you test cluster orchestration without paying for the underlying work (builds, simulations, LLM calls, …).
Configuration
List functions under a bypass: section. bypass: true opts a function in;
an optional tags: list (regex patterns) restricts it to matching calls; an
optional data: path supplies a file to serve.
bypass:
# Bypass every call to this function.
simple_add:
bypass: true
# Bypass only calls whose _chia_tag matches one of these patterns.
run_verilator_test:
bypass: true
tags: ["iter0_.*"]
# Shorthand: bypass: true with no extra options.
simple_multiply: true
summarize_perf:
bypass: true
data: /path/to/recorded_perf.md # served to the worker as a string
A function not listed (or bypass: false) runs normally. If no YAML is loaded
at all, bypass is a complete no-op.
Setup and providers
Create one Bypass instance as part of loop setup and register a provider
for each bypassed function. The provider runs on the worker (with the same
scheduling as the real function) and returns the replacement value:
from chia.base.bypass import Bypass
from chia.base.ChiaFunction import ChiaFunction, get
# If yaml_path is None, bypass does nothing.
bypass = Bypass(yaml_path=args.bypass_config)
# Required function signature:
# tag: the function's _chia_tag
# data_path: the YAML "data:" path or None
# *args, **kwargs: the original call's
def provider_42(tag, data_path, *args, **kwargs):
return 42
# Remote ChiaFunction
@ChiaFunction(resources={"adder": 1})
def simple_add(x, y):
return x + y
bypass.set_provider("simple_add", provider_42)
# Dispatched to a node with "adder" resource,
# but returns 42 from the provider instead of running.
result = get(simple_add.chia_remote(1, 2)) # -> 42
A function is only bypassed when it is
bypass: true and has either a registered provider or a data: path.
Serving a file instead of a provider
If you give a function a data: path but no provider, CHIA serves that file’s
contents as the result. The read is routed through a Ray actor pinned to the
node that constructed the Bypass (the head), so workers on other nodes can
read it without a shared filesystem:
bypass:
summarize_perf:
bypass: true
data: /path/to/recorded_perf.md # served to the worker as a string
Gating with a condition
Besides a provider, a function may register a condition — an extra gate that
decides, at dispatch time, whether the bypass actually happens. It runs on the
caller as the last step of the bypass decision (after the bypass: true
flag, the provider/data check, and the tag patterns) and returns a bool:
# cond(tag, data_path, *args, **kwargs) -> bool (same args as a provider)
def cache_hit(tag, data_path, *args, **kwargs):
return get(get_active_cache().has.chia_remote(tag))
bypass.set_cond("run_verilator_test", cache_hit)
A falsy return means the call is not bypassed and runs for real; a truthy
return lets the bypass proceed. When no condition is registered the default is
True (no extra gate). Use it to make the decision depend on runtime
state, most usefully to only replay from the cache when the value is actually
present (see Putting it together: populate, then replay below).
Cache
The cache is an LRU key/value store of
pickled (tag, data) files on disk that warm-starts by scanning the cache directory, so cached values survive across loop runs.
Since the cache is implemented as a remote Ray actor located on the head node, it must be accessed using chia_remote.
Start it once on the
driver after ray.init():
from chia.base.cache import start_cache
start_cache(size=4, units="GB",
cache_dir_path="/data/chia_cache",
yaml_path="path/to/yaml.yaml")
# Access via chia_remote and get
cache_hit = get(get_active_cache().has.chia_remote(tag))
Writing is automatic
In the yaml, mark a function cache: true in the cache: section (same file as
bypass:). Its output is then written to the cache automatically, keyed by the call’s _chia_tag. An optional tags: list
gates which calls are written.
cache:
run_verilator_test:
cache: true
tags: ["iter.*"] # only cache calls whose tag matches
# shorthand (cache, no tag filter):
build_megaboom: true
# With run_verilator_test cache:true, the real result is written under "iter0".
result = get(run_verilator_test.chia_remote(design, _chia_tag="iter0"))
Reading is manual (via bypass)
To replay a cached value, register a bypass provider that reads it back:
from chia.base.cache import get_active_cache
from chia.base.ChiaFunction import get
def cache_provider(tag, data_path, *args, **kwargs):
hit, value = get(get_active_cache().read.chia_remote(tag))
if not hit:
raise KeyError(f"cache miss for tag {tag!r}")
return value
bypass.set_provider("run_verilator_test", cache_provider)
# "iter0" was written on the real run above; this serves it without rerunning.
served = get(run_verilator_test.chia_remote(design, _chia_tag="iter0"))
Putting it together: populate, then replay
The common workflow is one real run that populates the cache, then later runs that replay it so you can iterate on the loop/orchestration without recomputing the expensive steps.
See examples/bypass_cache/bypass_cache_loop.py for a runnable version (run
it twice: cold then warm).