Chaos engineering toolkit for local processes, Docker, Kubernetes, and SSH. Inject faults, watch monitors, assert recovery, roll back automatically.
go install github.com/burntcarrot/rattle/cmd/rattle@latestOr from source:
git clone https://github.com/burntcarrot/rattle
cd rattle
go build -o rattle ./cmd/rattlepostgres-cpu-congestion.yaml
apiVersion: rattle/v1
name: postgres-cpu-congestion
env: local
on:
failure: abort
define:
envs:
db:
type: docker
container: postgres-local
steps:
- name: postgres is reachable
check: tcp
addr: localhost:5432
- name: pool is quiet before we start
check: sql
env: db
driver: postgres
dsn: postgres://postgres:postgres@localhost/postgres
query: "SELECT count(*) FROM pg_stat_activity WHERE state != 'idle'"
expect: "< 5"
- name: stress table is seeded
check: sql
env: db
driver: postgres
dsn: postgres://postgres:postgres@localhost/postgres
query: "SELECT count(*) FROM cpu_stress_test"
expect: "> 0"
- name: simple query responds fast before flood
check: sql
env: db
driver: postgres
dsn: postgres://postgres:postgres@localhost/postgres
query: "SELECT 1"
timeout: 300ms
- name: flood with expensive cross-join queries
inject: exec-loop
env: db
cmd: psql
args: [-U, postgres, -c, "SELECT COUNT(*) FROM cpu_stress_test a JOIN cpu_stress_test b ON a.id != b.id WHERE a.id < 1000 AND b.id < 1000"]
concurrency: 30
on_error: log
for: 60s
- name: watch simple queries for delays under saturation
watch: sql
env: db
driver: postgres
dsn: postgres://postgres:postgres@localhost/postgres
query: "SELECT 1"
timeout: 300ms
every: 4s
- name: pool drains after flood stops
check: sql
env: db
driver: postgres
dsn: postgres://postgres:postgres@localhost/postgres
query: "SELECT count(*) FROM pg_stat_activity WHERE state != 'idle'"
expect: "< 5"
role: recovery
retry:
attempts: 15
delay: 3srattle run scenario.yamlOutput
time=... level=INFO msg="hypothesis passed"
time=... level=INFO msg="applying fault" fault="flood with expensive cross-join queries"
time=... level=WARN msg="monitor check failed" monitor="watch simple queries for delays under saturation" err="sql probe: query exceeded timeout 300ms (took 2.107s)"
....
time=... level=WARN msg="monitor check failed" monitor="watch simple queries for delays under saturation" err="sql probe: query exceeded timeout 300ms (took 10.722s)"
....
time=... level=INFO msg="fault complete" fault="flood with expensive cross-join queries" duration=1m0s
time=... level=INFO msg="rolling back" fault="flood with expensive cross-join queries"
Scenario: postgres-cpu-congestion
Status: pass
Duration: 1m9.220312458s
Hypothesis:
✓ postgres is reachable (0.0s)
✓ pool is quiet before we start (0.3s)
✓ stress table is seeded (0.2s)
✓ simple query responds fast before flood (0.2s)
Faults:
✓ flood with expensive cross-join queries (60.0s)
Monitors:
✗ watch simple queries for delays under saturation (60.0s) checks=13 failures=13 failure_rate=1.00
last error: sql probe: query exceeded timeout 300ms (took 10.722s)
Recovery:
✓ pool drains after flood stops (8.5s)
Top-level fields
apiVersion: rattle/v1 # optional
name: string
description: string
env: local # default environment, string shorthand or inline object
on:
failure: abort # abort (default) | continue
schedule:
mode: sequential | parallel | staggered
delay: 5s # staggered only
define:
probes: { ... } # reusable probe templates
envs: { ... } # named environment configs
steps: [ ... ]Step discriminators
Every step has exactly one of:
| Key | Phase | Description |
|---|---|---|
check: |
hypothesis / recovery | Assert a condition passes. Add role: recovery to run after faults. |
inject: |
fault | Inject a fault. Rolls back automatically when for: expires or the scenario ends. |
watch: |
monitor | Sample a probe throughout fault injection. Never aborts the scenario. |
from: probes.<name> |
any | Instantiate a defined probe template. Override any param inline. |
Common step fields
- name: label
env: db # override environment for this step
retry: # check: and from: only
attempts: 5
delay: 2s
role: recovery # check: and from: onlyinject: additionally accepts for: 60s (how long to hold the fault) and env: for the target environment.
watch: additionally requires every: 5s.
http
check: http
url: http://localhost:8080/health # required
method: GET
status: [200, 204] # default [200]
body: "ok" # assert body contains string
latency: 500ms # fail if response time exceeds this
timeout: 10s # request deadline, default 10s
headers:
authorization: "Bearer token"tcp
check: tcp
addr: localhost:5432 # required (host:port)
timeout: 5sdns
check: dns
host: api.example.com
resolver: 8.8.8.8:53 # optional custom resolver
expect_ips: [1.2.3.4] # optional: assert these IPs appearexec
check: exec
cmd: curl
args: [-sf, http://localhost:8080/health]
expect_exit_code: 0
expect_stdout_contains: "healthy"Runs via env.Exec when an environment is set; falls back to the local shell.
sql: Postgres or MySQL via CLI
check: sql
driver: postgres # postgres | mysql (inferred from dsn prefix if omitted)
dsn: postgres://user:pass@localhost:5432/mydb
query: "SELECT count(*) FROM pg_stat_activity WHERE state = 'active'"
expect: "> 0" # numeric comparison or exact string match
timeout: 300ms # fail if query exceeds this durationSupported operators: >, <, >=, <=, =, !=. Runs psql/mysql via the step's environment; works inside containers, pods, and over SSH.
redis
check: redis
addr: localhost:6379
cmd: PING
expect: PONG
# numeric comparisons work too
cmd: SCARD myset
expect: "> 0"Runs redis-cli via the step's environment.
prometheus-metric
check: prometheus-metric
url: http://localhost:9090 # required: Prometheus base URL
query: 'rate(http_requests_total[1m])' # required: PromQL instant query
operator: gt # required: gt | lt | eq | gte | lte
threshold: 0 # value to compare againstexec-loop
inject: exec-loop
cmd: pkill
args: [-f, worker]
concurrency: 1 # parallel workers (default 1)
interval: 5s # delay between iterations (omit for no pause)
on_error: ignore # ignore | log | abort (default ignore)
for: 60sprocess-kill
inject: process-kill
target: myapp # process name or PID
signal: SIGTERM # default SIGTERM
match_all: false # kill all matching processes (default: first match)
probability: 1.0 # 0.0–1.0 injection probability
for: 30snetwork-latency
inject: network-latency
delay: 300ms # required
jitter: 30ms
interface: eth0 # auto-detected if omitted
for: 60sUses tc-netem on Linux; falls back to a TCP proxy on other platforms.
network-loss
inject: network-loss
loss_percent: 10
interface: eth0
for: 30snetwork-corruption
inject: network-corruption
corrupt_percent: 5
interface: eth0
for: 30shttp-fault: proxy-based HTTP fault injection
inject: http-fault
listen_port: 9090
upstream_addr: http://upstream:8080
delay: 200ms
jitter: 50ms
status_code: 503
match_path: ^/api/
match_method: POST
inject_headers:
x-chaos: "true"
response_headers:
retry-after: "5"
for: 60scpu-stress
inject: cpu-stress
load: 80 # target CPU load percent (0–100)
cores: 4 # number of cores to stress (default: all)
stress_ng_method: matrixfp # optional: stress-ng cpu method
affinity: [0, 1] # optional: pin to these CPU cores
for: 30smemory-stress
inject: memory-stress
bytes: 536870912 # bytes to allocate (536870912 = 512 MiB)
percentage: 0 # alternative: percent of available memory (overrides bytes if > 0)
leak: false # keep allocated without freeing
allocation_size: 4194304 # chunk size per allocation in bytes (default 4 MiB)
rate: 0 # bytes/s allocation rate (0 = as fast as possible)
for: 60sdisk-fill
inject: disk-fill
path: /tmp
fill_bytes: 10737418240 # bytes to write (10737418240 = 10 GiB)
block_size: 4194304 # write block size in bytes (default 4 MiB)
fsync: false
fill_rate: 0 # bytes/s write rate (0 = unlimited)
file_name: rattle-fill # filename in path (default: random)
for: 60sdisk-io-stress
inject: disk-io-stress
path: /tmp
block_size: 4096
fsync: true
for: 30stime-skew
inject: time-skew
process: myapp # process name or PID (required)
skew: 3600s # time offset to inject (requires libfaketime)
for: 30shttp-flood
inject: http-flood
url: http://localhost:8080/api
method: GET
concurrency: 50
for: 30sSet a default at the top level; override per-step with env:.
Environment types
local runs commands on the host (default when env: is omitted)
env: localdocker
env:
type: docker
container: myapp # name or IDkubernetes
env:
type: kubernetes
namespace: default
pod: myapp-abc123
container: app # optionalssh
env:
type: ssh
host: 10.0.0.5
user: ubuntu
key: ~/.ssh/id_rsaNamed environments can be defined once under define.envs and referenced by name:
define:
envs:
db:
type: docker
container: postgres-1
steps:
- inject: network-latency
env: db
delay: 300ms
for: 30sProbe templates
Define probe templates once under define.probes and instantiate them with from:. Any param can be overridden inline.
define:
probes:
api-health:
check: http
url: http://localhost:8080/health
status: [200]
latency: 500ms
steps:
- from: probes.api-health # hypothesis
- from: probes.api-health # override a param inline
url: http://localhost:8080/readyz
- from: probes.api-health # recovery with retries
role: recovery
retry:
attempts: 10
delay: 5sSchedule modes
schedule:
mode: parallel # all faults start simultaneously
schedule:
mode: staggered
delay: 10s # gap between faultsManual rollback
rattle writes a state file before each fault is applied. If the process is interrupted before rollback completes, inspect pending rollbacks manually:
rattle cleanup ./.rattle-state.jsonThe output lists each pending rollback with the exact command needed to undo it.
Run a scenario
import rattle "github.com/burntcarrot/rattle"
// Load from YAML
sc, err := rattle.LoadFile("scenario.yaml")
// Build programmatically
sc := rattle.Build("my-scenario").
Hypothesis(rattle.ProbeStep{ ... }).
WithFault(myFault, 30*time.Second, rattle.Params{}).
OnFailure(rattle.FailureContinue).
Scenario()
report, err := rattle.Run(ctx, sc,
rattle.WithLogger(myLogger),
rattle.WithStateFile("./.rattle-state.json"),
)Custom fault
type MyFault struct{}
func (f *MyFault) Name() string { return "my-fault" }
func (f *MyFault) Apply(ctx context.Context, env rattle.Environment, p rattle.Params) (rattle.Rollback, error) {
// inject
return func(ctx context.Context) error {
// undo
return nil
}, nil
}
func init() {
rattle.Faults.Register("my-fault", func() rattle.Fault { return &MyFault{} })
}Custom probe
type MyProbe struct{}
func (p *MyProbe) Name() string { return "my-probe" }
func (p *MyProbe) Check(ctx context.Context, env rattle.Environment, params rattle.Params) error {
return nil // nil = pass
}
func init() {
rattle.Probes.Register("my-probe", func() rattle.Probe { return &MyProbe{} })
}Custom environment
type MyEnv struct{}
func (e *MyEnv) Name() string { return "my-env" }
func (e *MyEnv) Exec(ctx context.Context, cmd string, args ...string) ([]byte, error) { ... }
func (e *MyEnv) Close() error { return nil }
func (e *MyEnv) Init(cfg map[string]any) error { return nil } // optional
func init() {
rattle.Envs.Register("my-env", func() rattle.Environment { return &MyEnv{} })
}rattle run <scenario.yaml> [flags]
rattle validate <scenario.yaml>
rattle list <faults|probes|envs>
rattle cleanup [state-file]
Flag reference
run
| Flag | Description |
|---|---|
--output |
Output format: console (default) or json |
--state-file |
Path to state file for rollback tracking |
--dry-run |
Print scenario summary and exit without running |
cleanup
Takes an optional positional argument: path to the state file (default: .rattle-state.json in the current directory). Prints pending rollback instructions; does not execute them automatically.
MIT
rattle is provided "as is", without warranty of any kind, express or implied, including but not limited to the warranties of merchantability, fitness for a particular purpose, and noninfringement. In no event shall the authors or copyright holders be liable for any claim, damages, or other liability, whether in an action of contract, tort, or otherwise, arising from, out of, or in connection with the software or the use or other dealings in the software.
Chaos engineering injects real faults into real systems. Running rattle against a production environment without proper safeguards can cause data loss, outages, or irreversible system changes. You are solely responsible for ensuring that scenarios are safe to run in your environment.
