Skip to content

burntcarrot/rattle

Repository files navigation

rattle banner

rattle

Chaos engineering toolkit for local processes, Docker, Kubernetes, and SSH. Inject faults, watch monitors, assert recovery, roll back automatically.

Install

go install github.com/burntcarrot/rattle/cmd/rattle@latest

Or from source:

git clone https://github.com/burntcarrot/rattle
cd rattle
go build -o rattle ./cmd/rattle

Quickstart

1. Write a scenario

postgres-cpu-congestion.yaml
apiVersion: rattle/v1
name: postgres-cpu-congestion

env: local

on:
  failure: abort

define:
  envs:
    db:
      type: docker
      container: postgres-local

steps:
  - name: postgres is reachable
    check: tcp
    addr: localhost:5432

  - name: pool is quiet before we start
    check: sql
    env: db
    driver: postgres
    dsn: postgres://postgres:postgres@localhost/postgres
    query: "SELECT count(*) FROM pg_stat_activity WHERE state != 'idle'"
    expect: "< 5"

  - name: stress table is seeded
    check: sql
    env: db
    driver: postgres
    dsn: postgres://postgres:postgres@localhost/postgres
    query: "SELECT count(*) FROM cpu_stress_test"
    expect: "> 0"

  - name: simple query responds fast before flood
    check: sql
    env: db
    driver: postgres
    dsn: postgres://postgres:postgres@localhost/postgres
    query: "SELECT 1"
    timeout: 300ms

  - name: flood with expensive cross-join queries
    inject: exec-loop
    env: db
    cmd: psql
    args: [-U, postgres, -c, "SELECT COUNT(*) FROM cpu_stress_test a JOIN cpu_stress_test b ON a.id != b.id WHERE a.id < 1000 AND b.id < 1000"]
    concurrency: 30
    on_error: log
    for: 60s

  - name: watch simple queries for delays under saturation
    watch: sql
    env: db
    driver: postgres
    dsn: postgres://postgres:postgres@localhost/postgres
    query: "SELECT 1"
    timeout: 300ms
    every: 4s

  - name: pool drains after flood stops
    check: sql
    env: db
    driver: postgres
    dsn: postgres://postgres:postgres@localhost/postgres
    query: "SELECT count(*) FROM pg_stat_activity WHERE state != 'idle'"
    expect: "< 5"
    role: recovery
    retry:
      attempts: 15
      delay: 3s

2. Run it

rattle run scenario.yaml
Output
time=... level=INFO msg="hypothesis passed"
time=... level=INFO msg="applying fault" fault="flood with expensive cross-join queries"
time=... level=WARN msg="monitor check failed" monitor="watch simple queries for delays under saturation" err="sql probe: query exceeded timeout 300ms (took 2.107s)"
....
time=... level=WARN msg="monitor check failed" monitor="watch simple queries for delays under saturation" err="sql probe: query exceeded timeout 300ms (took 10.722s)"
....
time=... level=INFO msg="fault complete" fault="flood with expensive cross-join queries" duration=1m0s
time=... level=INFO msg="rolling back" fault="flood with expensive cross-join queries"

Scenario: postgres-cpu-congestion
Status:   pass
Duration: 1m9.220312458s

Hypothesis:
  ✓ postgres is reachable (0.0s)
  ✓ pool is quiet before we start (0.3s)
  ✓ stress table is seeded (0.2s)
  ✓ simple query responds fast before flood (0.2s)

Faults:
  ✓ flood with expensive cross-join queries (60.0s)

Monitors:
  ✗ watch simple queries for delays under saturation (60.0s) checks=13 failures=13 failure_rate=1.00
      last error: sql probe: query exceeded timeout 300ms (took 10.722s)

Recovery:
  ✓ pool drains after flood stops (8.5s)

Schema

Top-level fields
apiVersion: rattle/v1    # optional
name: string
description: string

env: local               # default environment, string shorthand or inline object

on:
  failure: abort         # abort (default) | continue

schedule:
  mode: sequential | parallel | staggered
  delay: 5s              # staggered only

define:
  probes: { ... }        # reusable probe templates
  envs:   { ... }        # named environment configs

steps: [ ... ]
Step discriminators

Every step has exactly one of:

Key Phase Description
check: hypothesis / recovery Assert a condition passes. Add role: recovery to run after faults.
inject: fault Inject a fault. Rolls back automatically when for: expires or the scenario ends.
watch: monitor Sample a probe throughout fault injection. Never aborts the scenario.
from: probes.<name> any Instantiate a defined probe template. Override any param inline.
Common step fields
- name: label
  env: db           # override environment for this step
  retry:            # check: and from: only
    attempts: 5
    delay: 2s
  role: recovery    # check: and from: only

inject: additionally accepts for: 60s (how long to hold the fault) and env: for the target environment.

watch: additionally requires every: 5s.

Probes

http
check: http
url: http://localhost:8080/health   # required
method: GET
status: [200, 204]                  # default [200]
body: "ok"                          # assert body contains string
latency: 500ms                      # fail if response time exceeds this
timeout: 10s                        # request deadline, default 10s
headers:
  authorization: "Bearer token"
tcp
check: tcp
addr: localhost:5432   # required (host:port)
timeout: 5s
dns
check: dns
host: api.example.com
resolver: 8.8.8.8:53       # optional custom resolver
expect_ips: [1.2.3.4]      # optional: assert these IPs appear
exec
check: exec
cmd: curl
args: [-sf, http://localhost:8080/health]
expect_exit_code: 0
expect_stdout_contains: "healthy"

Runs via env.Exec when an environment is set; falls back to the local shell.

sql: Postgres or MySQL via CLI
check: sql
driver: postgres           # postgres | mysql (inferred from dsn prefix if omitted)
dsn: postgres://user:pass@localhost:5432/mydb
query: "SELECT count(*) FROM pg_stat_activity WHERE state = 'active'"
expect: "> 0"              # numeric comparison or exact string match
timeout: 300ms             # fail if query exceeds this duration

Supported operators: >, <, >=, <=, =, !=. Runs psql/mysql via the step's environment; works inside containers, pods, and over SSH.

redis
check: redis
addr: localhost:6379
cmd: PING
expect: PONG

# numeric comparisons work too
cmd: SCARD myset
expect: "> 0"

Runs redis-cli via the step's environment.

prometheus-metric
check: prometheus-metric
url: http://localhost:9090                 # required: Prometheus base URL
query: 'rate(http_requests_total[1m])'     # required: PromQL instant query
operator: gt                               # required: gt | lt | eq | gte | lte
threshold: 0                               # value to compare against

Faults

exec-loop
inject: exec-loop
cmd: pkill
args: [-f, worker]
concurrency: 1     # parallel workers (default 1)
interval: 5s       # delay between iterations (omit for no pause)
on_error: ignore   # ignore | log | abort (default ignore)
for: 60s
process-kill
inject: process-kill
target: myapp        # process name or PID
signal: SIGTERM      # default SIGTERM
match_all: false     # kill all matching processes (default: first match)
probability: 1.0     # 0.0–1.0 injection probability
for: 30s
network-latency
inject: network-latency
delay: 300ms       # required
jitter: 30ms
interface: eth0    # auto-detected if omitted
for: 60s

Uses tc-netem on Linux; falls back to a TCP proxy on other platforms.

network-loss
inject: network-loss
loss_percent: 10
interface: eth0
for: 30s
network-corruption
inject: network-corruption
corrupt_percent: 5
interface: eth0
for: 30s
http-fault: proxy-based HTTP fault injection
inject: http-fault
listen_port: 9090
upstream_addr: http://upstream:8080
delay: 200ms
jitter: 50ms
status_code: 503
match_path: ^/api/
match_method: POST
inject_headers:
  x-chaos: "true"
response_headers:
  retry-after: "5"
for: 60s
cpu-stress
inject: cpu-stress
load: 80                     # target CPU load percent (0–100)
cores: 4                     # number of cores to stress (default: all)
stress_ng_method: matrixfp   # optional: stress-ng cpu method
affinity: [0, 1]             # optional: pin to these CPU cores
for: 30s
memory-stress
inject: memory-stress
bytes: 536870912          # bytes to allocate (536870912 = 512 MiB)
percentage: 0             # alternative: percent of available memory (overrides bytes if > 0)
leak: false               # keep allocated without freeing
allocation_size: 4194304  # chunk size per allocation in bytes (default 4 MiB)
rate: 0                   # bytes/s allocation rate (0 = as fast as possible)
for: 60s
disk-fill
inject: disk-fill
path: /tmp
fill_bytes: 10737418240   # bytes to write (10737418240 = 10 GiB)
block_size: 4194304       # write block size in bytes (default 4 MiB)
fsync: false
fill_rate: 0              # bytes/s write rate (0 = unlimited)
file_name: rattle-fill    # filename in path (default: random)
for: 60s
disk-io-stress
inject: disk-io-stress
path: /tmp
block_size: 4096
fsync: true
for: 30s
time-skew
inject: time-skew
process: myapp   # process name or PID (required)
skew: 3600s      # time offset to inject (requires libfaketime)
for: 30s
http-flood
inject: http-flood
url: http://localhost:8080/api
method: GET
concurrency: 50
for: 30s

Environments

Set a default at the top level; override per-step with env:.

Environment types

local runs commands on the host (default when env: is omitted)

env: local

docker

env:
  type: docker
  container: myapp    # name or ID

kubernetes

env:
  type: kubernetes
  namespace: default
  pod: myapp-abc123
  container: app      # optional

ssh

env:
  type: ssh
  host: 10.0.0.5
  user: ubuntu
  key: ~/.ssh/id_rsa

Named environments can be defined once under define.envs and referenced by name:

define:
  envs:
    db:
      type: docker
      container: postgres-1

steps:
  - inject: network-latency
    env: db
    delay: 300ms
    for: 30s

Reusable templates

Probe templates

Define probe templates once under define.probes and instantiate them with from:. Any param can be overridden inline.

define:
  probes:
    api-health:
      check: http
      url: http://localhost:8080/health
      status: [200]
      latency: 500ms

steps:
  - from: probes.api-health           # hypothesis

  - from: probes.api-health           # override a param inline
    url: http://localhost:8080/readyz

  - from: probes.api-health           # recovery with retries
    role: recovery
    retry:
      attempts: 10
      delay: 5s

Parallel and staggered faults

Schedule modes
schedule:
  mode: parallel     # all faults start simultaneously

schedule:
  mode: staggered
  delay: 10s         # gap between faults

Crash recovery

Manual rollback

rattle writes a state file before each fault is applied. If the process is interrupted before rollback completes, inspect pending rollbacks manually:

rattle cleanup ./.rattle-state.json

The output lists each pending rollback with the exact command needed to undo it.

Go API

Run a scenario
import rattle "github.com/burntcarrot/rattle"

// Load from YAML
sc, err := rattle.LoadFile("scenario.yaml")

// Build programmatically
sc := rattle.Build("my-scenario").
    Hypothesis(rattle.ProbeStep{ ... }).
    WithFault(myFault, 30*time.Second, rattle.Params{}).
    OnFailure(rattle.FailureContinue).
    Scenario()

report, err := rattle.Run(ctx, sc,
    rattle.WithLogger(myLogger),
    rattle.WithStateFile("./.rattle-state.json"),
)
Custom fault
type MyFault struct{}

func (f *MyFault) Name() string { return "my-fault" }
func (f *MyFault) Apply(ctx context.Context, env rattle.Environment, p rattle.Params) (rattle.Rollback, error) {
    // inject
    return func(ctx context.Context) error {
        // undo
        return nil
    }, nil
}

func init() {
    rattle.Faults.Register("my-fault", func() rattle.Fault { return &MyFault{} })
}
Custom probe
type MyProbe struct{}

func (p *MyProbe) Name() string { return "my-probe" }
func (p *MyProbe) Check(ctx context.Context, env rattle.Environment, params rattle.Params) error {
    return nil // nil = pass
}

func init() {
    rattle.Probes.Register("my-probe", func() rattle.Probe { return &MyProbe{} })
}
Custom environment
type MyEnv struct{}

func (e *MyEnv) Name() string                                                         { return "my-env" }
func (e *MyEnv) Exec(ctx context.Context, cmd string, args ...string) ([]byte, error) { ... }
func (e *MyEnv) Close() error                                                         { return nil }
func (e *MyEnv) Init(cfg map[string]any) error                                        { return nil } // optional

func init() {
    rattle.Envs.Register("my-env", func() rattle.Environment { return &MyEnv{} })
}

CLI reference

rattle run      <scenario.yaml> [flags]
rattle validate <scenario.yaml>
rattle list     <faults|probes|envs>
rattle cleanup  [state-file]
Flag reference

run

Flag Description
--output Output format: console (default) or json
--state-file Path to state file for rollback tracking
--dry-run Print scenario summary and exit without running

cleanup

Takes an optional positional argument: path to the state file (default: .rattle-state.json in the current directory). Prints pending rollback instructions; does not execute them automatically.

License

MIT

Disclaimer

rattle is provided "as is", without warranty of any kind, express or implied, including but not limited to the warranties of merchantability, fitness for a particular purpose, and noninfringement. In no event shall the authors or copyright holders be liable for any claim, damages, or other liability, whether in an action of contract, tort, or otherwise, arising from, out of, or in connection with the software or the use or other dealings in the software.

Chaos engineering injects real faults into real systems. Running rattle against a production environment without proper safeguards can cause data loss, outages, or irreversible system changes. You are solely responsible for ensuring that scenarios are safe to run in your environment.

About

Chaos engineering toolkit for local processes, Docker, Kubernetes, and more.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages