Circuit Breakers
Wallfacer uses circuit breakers to automatically pause automation when something goes wrong, preventing cascading failures. They self-heal once the problem resolves — no manual intervention required.
There are two independent systems: watcher breakers (per-automation) and the container breaker (runtime-level).
Watcher Breakers
Each automation watcher (Promote, Retry, Test, Submit, Catch Up, Refine) has its own circuit breaker. When a watcher encounters repeated errors, its breaker opens and suppresses that watcher while leaving all others running.
What you see
The header shows a Circuit Breakers section. When all watchers are healthy, it displays a green dot with "All healthy." When a breaker trips, a red indicator appears showing:
- The affected watcher name (e.g. "Promote", "Retry")
- How many consecutive failures occurred
- A countdown until the next retry attempt
- The error reason (hover for details)
What triggers a watcher breaker
A breaker opens when the watcher's internal logic encounters an error during its scan-and-act cycle. For example:
- The auto-promoter fails to launch a container
- The auto-retrier encounters an unexpected store error
- The auto-submitter fails to commit changes
How it recovers
Watcher breakers use exponential backoff:
| Failure # | Cooldown |
|---|---|
| 1st | 30 seconds |
| 2nd | 1 minute |
| 3rd | 2 minutes |
| 4th | 4 minutes |
| 5th+ | 5 minutes (cap) |
After each cooldown, the watcher tries again. A single success resets the breaker completely — the failure counter goes back to zero and the watcher resumes normal operation.
What still works
Watcher breakers only affect automated actions. You can still:
- Manually promote tasks from the board
- Manually trigger refinement or testing
- Use all other UI features normally
Automation toggles (Autopilot, Auto-test, etc.) remain independent of breaker state. The breaker is a transient safety layer, not a configuration change.
Container Breaker
The container breaker protects against the container runtime (Docker or Podman) being unavailable. If the runtime crashes, gets uninstalled, or the daemon stops, wallfacer stops trying to launch containers until the runtime recovers.
What triggers it
The container breaker opens after 5 consecutive runtime errors (configurable). Only true runtime failures count:
- Exit code 125 (container engine failure)
- Connection refused (daemon not running)
- Binary not found
Normal agent failures (exit codes 1–124) do not trip the breaker.
How it recovers
The container breaker uses a three-state model:
- Closed (normal): All container launches proceed.
- Open (tripped): All launches blocked for 30 seconds.
- Half-open (probing): One launch is allowed as a probe.
- If the probe succeeds → circuit closes, operations resume.
- If the probe fails → circuit reopens for another 30 seconds.
What you see
The container breaker is not directly shown in the UI. Its effects are visible as:
- Auto-promote stops picking up new tasks
- Auto-retry stops retrying tasks that failed due to container crashes
- The board appears to "pause" temporarily
Once the container runtime is available again, the next probe succeeds and everything resumes automatically.
The container breaker state is available via GET /api/debug/runtime
(the container_circuit field) for monitoring.
Configuration
Both breakers are configurable via process-level environment variables (set in your shell, systemd unit, or container environment):
| Variable | Default | Description |
|---|---|---|
WALLFACER_CONTAINER_CB_THRESHOLD |
5 | Consecutive runtime failures before the container breaker opens |
WALLFACER_CONTAINER_CB_OPEN_SECONDS |
30 | How long the container breaker stays open before probing |
Watcher breaker timing (30s base, 5min cap) is not currently configurable.
Design Rationale
Circuit breakers exist because wallfacer runs automation continuously in the background. Without them, a transient failure (e.g. Docker daemon restarting) would cause every watcher to spam failed attempts, flood logs, and potentially exhaust resources. The breakers provide automatic backoff and recovery with no user action needed.
Key design choices:
- Per-watcher isolation: A failure in auto-test does not affect auto-promote. Each watcher recovers independently.
- User actions bypass breakers: Manual operations always work, even when automation is paused.
- No manual reset: Breakers self-heal. The exponential backoff ensures recovery within 5 minutes at most. A manual reset could cause re-triggering of the same failure.
- Automatic, not configurable toggles: Breakers are a safety mechanism, not a user preference. They open and close based on runtime conditions.