Chaos Engineering: Principles and Practical Tools

Introduction

Chaos engineering is the discipline of experimenting on a distributed system to build confidence in its ability to withstand turbulent conditions in production. Unlike traditional testing, chaos experiments proactively inject failures to uncover weaknesses before they cause customer-impacting incidents. This article covers the principles and practical tools for implementing chaos engineering.

Core Principles

The practice of chaos engineering rests on four principles defined in the Principles of Chaos:

**Build a hypothesis around steady-state behavior**: Define measurable indicators that your system is healthy.

2. **Vary real-world events**: Inject failures that mirror actual production incidents.

3. **Run experiments in production**: Use a small blast radius and automated rollback.

4. **Automate experiments to run continuously**: Chaos should be a regular part of operations.

Steady-State Hypothesis

Define measurable metrics that represent healthy behavior before and after experiments:


# steady-state.yml

steady_state_hypothesis:

  title: "Payment service remains available during node failure"

  probes:

    - name: payment-api-health

      type: http

      provider:

        url: "https://api.example.com/health"

        expected_status: 200

        timeout: 5

    - name: payment-latency-p99

      type: promql

      provider:

        query: |

          histogram_quantile(0.99,

            sum(rate(http_request_duration_seconds_bucket{

              service="payment", status="200"

            }[5m])) by (le))

        expected_value:

          max: 500  # p99 under 500ms

    - name: error-rate

      type: promql

      provider:

        query: |

          sum(rate(http_requests_total{

            service="payment", status=~"5.."

          }[5m])) / sum(rate(http_requests_total{

            service="payment"

          }[5m]))

        expected_value:

          max: 0.01  # Error rate under 1%

Chaos Monkey and Simian Army

Netflix's Chaos Monkey randomly terminates EC2 instances to ensure services survive instance failures:


// Chaos Monkey configuration

chaos.monkey:

  enabled: true

  assaults:

    level: 3  # 1-5 intensity

    latency-active: true

    latency-range-start: 3000

    latency-range-end: 10000

  watcher:

    controller: true

    restController: true

    service: true

    component: true

    repository: true

For Spring Boot applications, integrate Chaos Monkey directly:


# application.yml

spring:

  application:

    name: payment-service



chaos:

  monkey:

    enabled: true

    watcher:

      controller: true

    assaults:

      exceptions-active: true

      kill-application-active: false

      memory-active: false

LitmusChaos on Kubernetes

LitmusChaos provides declarative chaos experiments as Kubernetes CRDs:


apiVersion: litmuschaos.io/v1alpha1

kind: ChaosEngine

metadata:

  name: payment-chaos

spec:

  appinfo:

    appns: "production"

    applabel: "app=payment"

    appkind: "deployment"

  chaosServiceAccount: litmus-admin

  experiments:

    - name: pod-delete

      spec:

        probe:

          - name: payment-health-probe

            type: httpProbe

            httpProbe/input:

              url: "http://payment-svc.production:8080/health"

              expectedStatusCode: 200

            mode: Continuous

            runProperties:

              probeTimeout: 5

              interval: 2

              retry: 1

        components:

          env:

            - name: TOTAL_CHAOS_DURATION

              value: "60"

            - name: CHAOS_INTERVAL

              value: "10"

            - name: FORCE

              value: "false"

            - name: RAMP_TIME

              value: "10"

        rank: 1

    - name: pod-cpu-hog

      spec:

        rank: 2

        components:

          env:

            - name: TOTAL_CHAOS_DURATION

              value: "120"

            - name: CPU_CORES

              value: "1"

Observe experiment results programmatically:


# Get experiment status

kubectl get chaosresult payment-chaos-pod-delete \

  -n production -o jsonpath='{.status.experimentStatus.verdict}'



# Expected output: "Pass" or "Fail"

Gremlin

Gremlin offers a SaaS platform with a rich set of attack types:


# Install Gremlin agent

curl -sSL https://get.gremlin.com | sudo bash

sudo gremlin config auth --client-id $CLIENT_ID --client-secret $CLIENT_SECRET



# Run a CPU attack on a specific container

gremlin attack container cpu \

  --container-name payment-api \

  --capacity 1 \

  --length 60 \

  --target $(hostname)



# Blackhole network traffic to a specific host

gremlin attack container blackhole \

  --container-name payment-api \

  --length 30 \

  --destination-ip 10.0.1.50



# Shutdown a process

gremlin attack container process \

  --container-name payment-api \

  --process "java" \

  --length 0  # Indefinite until manually halted

Gremlin's API enables automated experiment orchestration:


import gremlinapi



client = gremlinapi.Client(api_key="...")



experiment = client.create_experiment(

    name="Payment Service Node Failure",

    blast_radius={"targets": {"tags": {"service": "payment"}}},

    attacks=[{

        "type": "Shutdown",

        "target": {"type": "RandomPod", "count": 1},

        "length": 120,

    }],

    hypothesis={

        "metrics": [

            {"type": "latency", "query": "p99_latency{service='payment'}",

             "threshold": 1000, "comparison": "less_than"},

        ]

    }

)

experiment.run()

experiment.wait_for_completion()

Blast Radius Control

Always limit the scope of chaos experiments:


# LitmusChaos: blast radius constraints

spec:

  # Restrict to non-critical hours

  experiments:

    - name: pod-delete

      spec:

        components:

          env:

            - name: PODS_AFFECTED_PERC

              value: "20"  # Max 20% of pods

            - name: TARGET_PODS

              value: "1"   # Max 1 pod absolute

            - name: SEQUENCE

              value: "serial"  # Serial execution

  # Run only during window

  schedule:

    instant: false

    cron: "0 14 * * 1-5"  # Weekdays 2 PM

Game Days

Game days are structured chaos exercises involving the whole team:


# Game Day Plan: Payment Service Outage



## Scenario

Primary payment database experiences a regional failure.



## Timeline

1. T-15min: Brief team on scenario and objectives

2. T-0: Inject failure (block database traffic)

3. T+5min: Monitor alerts and team response

4. T+15min: Declare incident if threshold breached

5. T+30min: Evaluate failover mechanisms

6. T+60min: Restore and debrief



## Success Criteria

- [ ] Read traffic served from replica within 30s

- [ ] Failed payments queued for retry

- [ ] Alert triggers within 2 minutes

- [ ] No data loss



## Rollback Plan

- Execute ChaosEngine with `abort: true`

- Verify replica promotion succeeded

- Confirm application health endpoints

Experiment Design Checklist

**Start small**: begin in staging, target non-critical services

**Automate rollback**: define explicit abort conditions

**Monitor continuously**: observe dashboards during experiments

**Document findings**: share results in blameless post-mortems

**Incrementally increase scope**: expand blast radius and attack types gradually

Chaos engineering transforms the way teams think about reliability. Instead of hoping failures do not happen, you proactively prove your system survives them. Start with weekly pod-delete experiments in staging, then graduate to more complex scenarios like network partitions and regional failures in production.