rattle

Chaos engineering toolkit for local processes, Docker, Kubernetes, and SSH. Inject faults, watch monitors, assert recovery, roll back automatically.

Install

go install github.com/burntcarrot/rattle/cmd/rattle@latest

Or from source:

git clone https://github.com/burntcarrot/rattle
cd rattle
go build -o rattle ./cmd/rattle

Quickstart

1. Write a scenario

postgres-cpu-congestion.yaml

apiVersion: rattle/v1
name: postgres-cpu-congestion

env: local

on:
  failure: abort

define:
  envs:
    db:
      type: docker
      container: postgres-local

steps:
  - name: postgres is reachable
    check: tcp
    addr: localhost:5432

  - name: pool is quiet before we start
    check: sql
    env: db
    driver: postgres
    dsn: postgres://postgres:postgres@localhost/postgres
    query: "SELECT count(*) FROM pg_stat_activity WHERE state != 'idle'"
    expect: "< 5"

  - name: stress table is seeded
    check: sql
    env: db
    driver: postgres
    dsn: postgres://postgres:postgres@localhost/postgres
    query: "SELECT count(*) FROM cpu_stress_test"
    expect: "> 0"

  - name: simple query responds fast before flood
    check: sql
    env: db
    driver: postgres
    dsn: postgres://postgres:postgres@localhost/postgres
    query: "SELECT 1"
    timeout: 300ms

  - name: flood with expensive cross-join queries
    inject: exec-loop
    env: db
    cmd: psql
    args: [-U, postgres, -c, "SELECT COUNT(*) FROM cpu_stress_test a JOIN cpu_stress_test b ON a.id != b.id WHERE a.id < 1000 AND b.id < 1000"]
    concurrency: 30
    on_error: log
    for: 60s

  - name: watch simple queries for delays under saturation
    watch: sql
    env: db
    driver: postgres
    dsn: postgres://postgres:postgres@localhost/postgres
    query: "SELECT 1"
    timeout: 300ms
    every: 4s

  - name: pool drains after flood stops
    check: sql
    env: db
    driver: postgres
    dsn: postgres://postgres:postgres@localhost/postgres
    query: "SELECT count(*) FROM pg_stat_activity WHERE state != 'idle'"
    expect: "< 5"
    role: recovery
    retry:
      attempts: 15
      delay: 3s

2. Run it

rattle run scenario.yaml

Output

time=... level=INFO msg="hypothesis passed"
time=... level=INFO msg="applying fault" fault="flood with expensive cross-join queries"
time=... level=WARN msg="monitor check failed" monitor="watch simple queries for delays under saturation" err="sql probe: query exceeded timeout 300ms (took 2.107s)"
....
time=... level=WARN msg="monitor check failed" monitor="watch simple queries for delays under saturation" err="sql probe: query exceeded timeout 300ms (took 10.722s)"
....
time=... level=INFO msg="fault complete" fault="flood with expensive cross-join queries" duration=1m0s
time=... level=INFO msg="rolling back" fault="flood with expensive cross-join queries"

Scenario: postgres-cpu-congestion
Status:   pass
Duration: 1m9.220312458s

Hypothesis:
  ✓ postgres is reachable (0.0s)
  ✓ pool is quiet before we start (0.3s)
  ✓ stress table is seeded (0.2s)
  ✓ simple query responds fast before flood (0.2s)

Faults:
  ✓ flood with expensive cross-join queries (60.0s)

Monitors:
  ✗ watch simple queries for delays under saturation (60.0s) checks=13 failures=13 failure_rate=1.00
      last error: sql probe: query exceeded timeout 300ms (took 10.722s)

Recovery:
  ✓ pool drains after flood stops (8.5s)

Schema

Top-level fields

apiVersion: rattle/v1    # optional
name: string
description: string

env: local               # default environment, string shorthand or inline object

on:
  failure: abort         # abort (default) | continue

schedule:
  mode: sequential | parallel | staggered
  delay: 5s              # staggered only

define:
  probes: { ... }        # reusable probe templates
  envs:   { ... }        # named environment configs

steps: [ ... ]

Step discriminators

Every step has exactly one of:

Key	Phase	Description
`check:`	hypothesis / recovery	Assert a condition passes. Add `role: recovery` to run after faults.
`inject:`	fault	Inject a fault. Rolls back automatically when `for:` expires or the scenario ends.
`watch:`	monitor	Sample a probe throughout fault injection. Never aborts the scenario.
`from: probes.<name>`	any	Instantiate a defined probe template. Override any param inline.

Common step fields

- name: label
  env: db           # override environment for this step
  retry:            # check: and from: only
    attempts: 5
    delay: 2s
  role: recovery    # check: and from: only

inject: additionally accepts for: 60s (how long to hold the fault) and env: for the target environment.

watch: additionally requires every: 5s.

Probes

http

check: http
url: http://localhost:8080/health   # required
method: GET
status: [200, 204]                  # default [200]
body: "ok"                          # assert body contains string
latency: 500ms                      # fail if response time exceeds this
timeout: 10s                        # request deadline, default 10s
headers:
  authorization: "Bearer token"

tcp

check: tcp
addr: localhost:5432   # required (host:port)
timeout: 5s

dns

check: dns
host: api.example.com
resolver: 8.8.8.8:53       # optional custom resolver
expect_ips: [1.2.3.4]      # optional: assert these IPs appear

exec

check: exec
cmd: curl
args: [-sf, http://localhost:8080/health]
expect_exit_code: 0
expect_stdout_contains: "healthy"

Runs via env.Exec when an environment is set; falls back to the local shell.

sql: Postgres or MySQL via CLI

check: sql
driver: postgres           # postgres | mysql (inferred from dsn prefix if omitted)
dsn: postgres://user:pass@localhost:5432/mydb
query: "SELECT count(*) FROM pg_stat_activity WHERE state = 'active'"
expect: "> 0"              # numeric comparison or exact string match
timeout: 300ms             # fail if query exceeds this duration

Supported operators: >, <, >=, <=, =, !=. Runs psql/mysql via the step's environment; works inside containers, pods, and over SSH.

redis

check: redis
addr: localhost:6379
cmd: PING
expect: PONG

# numeric comparisons work too
cmd: SCARD myset
expect: "> 0"

Runs redis-cli via the step's environment.

prometheus-metric

check: prometheus-metric
url: http://localhost:9090                 # required: Prometheus base URL
query: 'rate(http_requests_total[1m])'     # required: PromQL instant query
operator: gt                               # required: gt | lt | eq | gte | lte
threshold: 0                               # value to compare against

Faults

exec-loop

inject: exec-loop
cmd: pkill
args: [-f, worker]
concurrency: 1     # parallel workers (default 1)
interval: 5s       # delay between iterations (omit for no pause)
on_error: ignore   # ignore | log | abort (default ignore)
for: 60s

process-kill

inject: process-kill
target: myapp        # process name or PID
signal: SIGTERM      # default SIGTERM
match_all: false     # kill all matching processes (default: first match)
probability: 1.0     # 0.0–1.0 injection probability
for: 30s

network-latency

inject: network-latency
delay: 300ms       # required
jitter: 30ms
interface: eth0    # auto-detected if omitted
for: 60s

Uses tc-netem on Linux; falls back to a TCP proxy on other platforms.

network-loss

inject: network-loss
loss_percent: 10
interface: eth0
for: 30s

network-corruption

inject: network-corruption
corrupt_percent: 5
interface: eth0
for: 30s

http-fault: proxy-based HTTP fault injection

inject: http-fault
listen_port: 9090
upstream_addr: http://upstream:8080
delay: 200ms
jitter: 50ms
status_code: 503
match_path: ^/api/
match_method: POST
inject_headers:
  x-chaos: "true"
response_headers:
  retry-after: "5"
for: 60s

cpu-stress

inject: cpu-stress
load: 80                     # target CPU load percent (0–100)
cores: 4                     # number of cores to stress (default: all)
stress_ng_method: matrixfp   # optional: stress-ng cpu method
affinity: [0, 1]             # optional: pin to these CPU cores
for: 30s

memory-stress

inject: memory-stress
bytes: 536870912          # bytes to allocate (536870912 = 512 MiB)
percentage: 0             # alternative: percent of available memory (overrides bytes if > 0)
leak: false               # keep allocated without freeing
allocation_size: 4194304  # chunk size per allocation in bytes (default 4 MiB)
rate: 0                   # bytes/s allocation rate (0 = as fast as possible)
for: 60s

disk-fill

inject: disk-fill
path: /tmp
fill_bytes: 10737418240   # bytes to write (10737418240 = 10 GiB)
block_size: 4194304       # write block size in bytes (default 4 MiB)
fsync: false
fill_rate: 0              # bytes/s write rate (0 = unlimited)
file_name: rattle-fill    # filename in path (default: random)
for: 60s

disk-io-stress

inject: disk-io-stress
path: /tmp
block_size: 4096
fsync: true
for: 30s

time-skew

inject: time-skew
process: myapp   # process name or PID (required)
skew: 3600s      # time offset to inject (requires libfaketime)
for: 30s

http-flood

inject: http-flood
url: http://localhost:8080/api
method: GET
concurrency: 50
for: 30s

Environments

Set a default at the top level; override per-step with env:.

Environment types

local runs commands on the host (default when env: is omitted)

env: local

docker

env:
  type: docker
  container: myapp    # name or ID

kubernetes

env:
  type: kubernetes
  namespace: default
  pod: myapp-abc123
  container: app      # optional

ssh

env:
  type: ssh
  host: 10.0.0.5
  user: ubuntu
  key: ~/.ssh/id_rsa

Named environments can be defined once under define.envs and referenced by name:

define:
  envs:
    db:
      type: docker
      container: postgres-1

steps:
  - inject: network-latency
    env: db
    delay: 300ms
    for: 30s

Reusable templates

Probe templates

Define probe templates once under define.probes and instantiate them with from:. Any param can be overridden inline.

define:
  probes:
    api-health:
      check: http
      url: http://localhost:8080/health
      status: [200]
      latency: 500ms

steps:
  - from: probes.api-health           # hypothesis

  - from: probes.api-health           # override a param inline
    url: http://localhost:8080/readyz

  - from: probes.api-health           # recovery with retries
    role: recovery
    retry:
      attempts: 10
      delay: 5s

Parallel and staggered faults

Schedule modes

schedule:
  mode: parallel     # all faults start simultaneously

schedule:
  mode: staggered
  delay: 10s         # gap between faults

Crash recovery

Manual rollback

rattle writes a state file before each fault is applied. If the process is interrupted before rollback completes, inspect pending rollbacks manually:

rattle cleanup ./.rattle-state.json

The output lists each pending rollback with the exact command needed to undo it.

Go API

Run a scenario

import rattle "github.com/burntcarrot/rattle"

// Load from YAML
sc, err := rattle.LoadFile("scenario.yaml")

// Build programmatically
sc := rattle.Build("my-scenario").
    Hypothesis(rattle.ProbeStep{ ... }).
    WithFault(myFault, 30*time.Second, rattle.Params{}).
    OnFailure(rattle.FailureContinue).
    Scenario()

report, err := rattle.Run(ctx, sc,
    rattle.WithLogger(myLogger),
    rattle.WithStateFile("./.rattle-state.json"),
)

Custom fault

type MyFault struct{}

func (f *MyFault) Name() string { return "my-fault" }
func (f *MyFault) Apply(ctx context.Context, env rattle.Environment, p rattle.Params) (rattle.Rollback, error) {
    // inject
    return func(ctx context.Context) error {
        // undo
        return nil
    }, nil
}

func init() {
    rattle.Faults.Register("my-fault", func() rattle.Fault { return &MyFault{} })
}

Custom probe

type MyProbe struct{}

func (p *MyProbe) Name() string { return "my-probe" }
func (p *MyProbe) Check(ctx context.Context, env rattle.Environment, params rattle.Params) error {
    return nil // nil = pass
}

func init() {
    rattle.Probes.Register("my-probe", func() rattle.Probe { return &MyProbe{} })
}

Custom environment

type MyEnv struct{}

func (e *MyEnv) Name() string                                                         { return "my-env" }
func (e *MyEnv) Exec(ctx context.Context, cmd string, args ...string) ([]byte, error) { ... }
func (e *MyEnv) Close() error                                                         { return nil }
func (e *MyEnv) Init(cfg map[string]any) error                                        { return nil } // optional

func init() {
    rattle.Envs.Register("my-env", func() rattle.Environment { return &MyEnv{} })
}

CLI reference

rattle run      <scenario.yaml> [flags]
rattle validate <scenario.yaml>
rattle list     <faults|probes|envs>
rattle cleanup  [state-file]

Flag reference

run

Flag	Description
`--output`	Output format: `console` (default) or `json`
`--state-file`	Path to state file for rollback tracking
`--dry-run`	Print scenario summary and exit without running

cleanup

Takes an optional positional argument: path to the state file (default: .rattle-state.json in the current directory). Prints pending rollback instructions; does not execute them automatically.

License

MIT

Disclaimer

rattle is provided "as is", without warranty of any kind, express or implied, including but not limited to the warranties of merchantability, fitness for a particular purpose, and noninfringement. In no event shall the authors or copyright holders be liable for any claim, damages, or other liability, whether in an action of contract, tort, or otherwise, arising from, out of, or in connection with the software or the use or other dealings in the software.

Chaos engineering injects real faults into real systems. Running rattle against a production environment without proper safeguards can cause data loss, outages, or irreversible system changes. You are solely responsible for ensuring that scenarios are safe to run in your environment.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
.github		.github
cmd/rattle		cmd/rattle
env		env
fault		fault
probe		probe
.gitignore		.gitignore
.goreleaser.yml		.goreleaser.yml
LICENSE		LICENSE
README.md		README.md
builder.go		builder.go
engine.go		engine.go
engine_test.go		engine_test.go
go.mod		go.mod
go.sum		go.sum
loader.go		loader.go
params.go		params.go
params_test.go		params_test.go
preflight.go		preflight.go
preflight_process.go		preflight_process.go
rattle.go		rattle.go
registry.go		registry.go
registry_test.go		registry_test.go
report.go		report.go
state.go		state.go

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

rattle

Install

Quickstart

1. Write a scenario

2. Run it

Schema

Probes

Faults

Environments

Reusable templates

Parallel and staggered faults

Crash recovery

Go API

CLI reference

License

Disclaimer

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

rattle

Install

Quickstart

1. Write a scenario

2. Run it

Schema

Probes

Faults

Environments

Reusable templates

Parallel and staggered faults

Crash recovery

Go API

CLI reference

License

Disclaimer

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages