Skip to content

Chaos Monkey Testing

The Fire Drill Analogy

Why do schools practice fire drills? Because when a real emergency happens, panic is the enemy. Fire drills teach muscle memory: evacuate calmly, check for stragglers, meet at the rally point.

Chaos Monkey does the same for your code. It simulates disasters in development so production failures don't surprise you.

Imagine launching your app only to discover:

  • ❌ Users with slow connections see eternal loading spinners
  • ❌ Dropped WebSocket messages break your chat feature
  • ❌ API timeouts cause white screens

With Chaos Monkey, you discover these issues BEFORE users do.

The Problem: Production Is a Hostile Environment

Your development environment is a lie:

  • Network: Localhost is instant. Real users have 200ms latency
  • Reliability: APIs always respond. Production has 503 errors
  • Concurrency: One user testing. Production has 10,000 simultaneous users

The "Works on My Machine" trap:

typescript
// This works in development
bus.emit('chat:message', message);
// Assumes message always arrives

But in production:

  • 10% of messages dropped (packet loss)
  • 500ms delay (network congestion)
  • Random errors (server restarts)

The Solution: Controlled Chaos

Nexus Chaos Monkey injects failures into your event bus:

typescript
const bus = new Nexus({
  chaos: {
    dropRate: 0.1,       // Drop 10% of events
    maxDelay: 2000,      // Add up to 2s latency
    exclude: ['auth:*']  // Never break authentication
  }
});

Every event now simulates real-world conditions. If your app works with chaos enabled, it's bulletproof.

How Chaos Monkey Works

Behind the scenes:

typescript
emit(event, payload, options) {
  if (this.config.chaos && !options.fromRemote) {
    const { dropRate = 0.1, maxDelay = 1000, exclude = [] } = this.config.chaos;
    
    // Skip chaos for critical events
    if (!exclude.includes(event)) {
      // 1. Packet Loss Simulation
      if (Math.random() < dropRate) {
        console.warn(`[Chaos] 🔥 Dropped event: ${event}`);
        return; // Event lost!
      }
      
      // 2. Network Latency Simulation
      const delay = Math.random() * maxDelay;
      if (delay > 50) {
        setTimeout(() => {
          // Re-emit without chaos to avoid infinite recursion
          this.emitWithoutChaos(event, payload);
        }, delay);
        return; // Delayed execution
      }
    }
  }
  
  // Normal execution
  executeListeners(event, payload);
}

Basic Example: Chat Application

typescript
import { Nexus } from '@caeligo/nexus-orchestrator';

// Enable Chaos Monkey in development
const bus = new Nexus({
  chaos: {
    dropRate: 0.15,      // 15% message loss (harsh!)
    maxDelay: 3000,      // Up to 3 seconds delay
    exclude: ['app:init', 'auth:*'] // Don't break critical flows
  },
  debug: true // See chaos logs
});

// Send message
function sendMessage(text: string) {
  const messageId = Math.random().toString(36);
  
  bus.emit('chat:send', {
    id: messageId,
    text,
    timestamp: Date.now()
  });
  
  // UI: Show "sending..." indicator
  addMessageToUI(messageId, text, { status: 'sending' });
}

// Handle message confirmation
bus.on('chat:send', (message) => {
  // This might never arrive (chaos drops it)
  updateMessageStatus(message.id, 'sent');
  
  // After 5 seconds, assume failure
  setTimeout(() => {
    if (getMessageStatus(message.id) === 'sending') {
      updateMessageStatus(message.id, 'failed');
      showRetryButton(message.id);
    }
  }, 5000);
});

What you'll discover:

  • Messages stuck in "sending" state forever
  • Users see "failed" messages they need to retry
  • Need to implement message queuing and retry logic

Real-World Example: E-Commerce with Chaos

typescript
const bus = new Nexus({
  chaos: {
    dropRate: 0.1,
    maxDelay: 2000,
    exclude: [
      'payment:*',      // Never break payments
      'auth:*',         // Never break auth
      'security:*'      // Never break security
    ]
  }
});

// Add to cart (might fail or delay)
function addToCart(productId: number) {
  // Show optimistic UI
  showCartBadge('+1');
  
  bus.emit('cart:add', { productId });
  
  // Set timeout to detect chaos-dropped events
  const timeoutId = setTimeout(() => {
    // Event never arrived - rollback UI
    showCartBadge('-1');
    showNotification('Failed to add to cart. Please try again.');
  }, 5000);
  
  // Clear timeout on success
  bus.once('cart:add-success', () => {
    clearTimeout(timeoutId);
  });
}

// Handler with chaos-aware logic
bus.on('cart:add', async (payload) => {
  try {
    await fetch('/api/cart/add', {
      method: 'POST',
      body: JSON.stringify(payload)
    });
    
    // Confirm success
    bus.emit('cart:add-success', payload);
  } catch (err) {
    bus.emit('cart:add-failure', { ...payload, error: err.message });
  }
}, {
  attempts: 3,  // Combine with resilience!
  backoff: 1000,
  fallback: (err) => {
    console.error('Cart add failed after retries:', err);
    bus.emit('cart:add-failure', { error: 'Network error' });
  }
});

Chaos reveals:

  • Need optimistic UI updates
  • Need timeout detection
  • Need retry/rollback logic
  • Need user feedback for failures

Edge Case: Chaos During Tests

Should you enable Chaos Monkey in automated tests?

For unit tests: ❌ No. Tests should be deterministic.

For integration tests: ✅ Yes. Create chaos-specific test suites:

typescript
import { describe, it, expect } from 'vitest';
import { Nexus } from '@caeligo/nexus-orchestrator';

describe('Chat with Chaos', () => {
  it('should handle dropped messages', async () => {
    const bus = new Nexus({
      chaos: { dropRate: 0.5, maxDelay: 100 }
    });
    
    let receivedCount = 0;
    bus.on('message', () => receivedCount++);
    
    // Send 100 messages
    for (let i = 0; i < 100; i++) {
      bus.emit('message', { id: i });
    }
    
    await new Promise(resolve => setTimeout(resolve, 500));
    
    // With 50% drop rate, expect ~50 messages
    expect(receivedCount).toBeGreaterThan(30);
    expect(receivedCount).toBeLessThan(70);
  });
});

Edge Case: Selective Chaos

Different chaos levels for different event types:

typescript
// Custom chaos wrapper
function emitWithChaos(event: string, payload: any, chaosLevel: number) {
  const shouldDrop = Math.random() < chaosLevel;
  
  if (shouldDrop) {
    console.warn(`[Custom Chaos] Dropped ${event}`);
    return;
  }
  
  bus.emit(event, payload);
}

// Usage
emitWithChaos('analytics:track', data, 0.3);  // 30% chaos
emitWithChaos('ui:update', data, 0.05);       // 5% chaos

Pattern: Chaos Dashboard

Visualize chaos impact in development:

typescript
let chaosStats = {
  dropped: 0,
  delayed: 0,
  total: 0
};

// Override emit to track chaos
const originalEmit = bus.emit.bind(bus);
bus.emit = function(event, payload, options) {
  chaosStats.total++;
  
  // Check if chaos will trigger (simplified)
  if (this.config.chaos && Math.random() < this.config.chaos.dropRate) {
    chaosStats.dropped++;
  }
  
  return originalEmit(event, payload, options);
};

// Dashboard UI
function renderChaosDashboard() {
  const dropRate = (chaosStats.dropped / chaosStats.total * 100).toFixed(1);
  
  return `
    <div class="chaos-dashboard">
      <h3>🐵 Chaos Monkey Stats</h3>
      <p>Total Events: ${chaosStats.total}</p>
      <p>Dropped: ${chaosStats.dropped} (${dropRate}%)</p>
      <p>Delayed: ${chaosStats.delayed}</p>
    </div>
  `;
}

// Update every second
setInterval(() => {
  document.getElementById('chaos-stats').innerHTML = renderChaosDashboard();
}, 1000);

Pattern: Graduated Chaos

Start with mild chaos, increase over time:

typescript
let chaosIntensity = 0.05; // Start at 5%

const bus = new Nexus({
  get chaos() {
    return {
      dropRate: chaosIntensity,
      maxDelay: chaosIntensity * 10000 // 5% = 500ms max
    };
  }
});

// Increase chaos every minute
setInterval(() => {
  chaosIntensity = Math.min(chaosIntensity + 0.05, 0.5); // Max 50%
  console.log(`Chaos intensity: ${(chaosIntensity * 100).toFixed(0)}%`);
}, 60000);

Pattern: Chaos Scenarios

Test specific failure modes:

typescript
// Scenario 1: Total Network Outage
function simulateNetworkOutage(duration: number) {
  const originalChaos = bus.config.chaos;
  
  bus.config.chaos = {
    dropRate: 1.0,  // Drop everything
    maxDelay: 0
  };
  
  setTimeout(() => {
    bus.config.chaos = originalChaos;
    console.log('Network restored');
  }, duration);
}

// Scenario 2: High Latency
function simulateHighLatency() {
  bus.config.chaos = {
    dropRate: 0,
    maxDelay: 10000  // 10 second delays!
  };
}

// Scenario 3: Packet Loss Storm
function simulatePacketLoss() {
  bus.config.chaos = {
    dropRate: 0.8,  // 80% loss
    maxDelay: 500
  };
}

// Test button in UI
document.getElementById('test-outage')?.addEventListener('click', () => {
  simulateNetworkOutage(5000);
});

Comparison with Other Testing Approaches

ApproachRealismEffortCoverage
Manual TestingLowHighSpotty
Unit TestsLowMediumGood
Integration TestsMediumHighGood
Chaos MonkeyHighLowExcellent

Safety Checklist

Before enabling Chaos Monkey:

  • [ ] Only enable in development and staging
  • [ ] Never enable in production (obvious, but worth stating)
  • [ ] Exclude critical events (auth, payments, security)
  • [ ] Warn team members (console message on init)
  • [ ] Document chaos behavior in README
  • [ ] Use environment variables to control chaos
typescript
const bus = new Nexus({
  chaos: process.env.NODE_ENV === 'development' ? {
    dropRate: 0.1,
    maxDelay: 2000,
    exclude: ['auth:*', 'payment:*']
  } : undefined  // Disabled in production
});

Debugging Chaos Issues

When chaos reveals a bug:

  1. Check console: Chaos logs dropped events
  2. Disable chaos: Verify bug is chaos-related
  3. Add timeouts: Detect missing events
  4. Add retries: Use resilience features
  5. Add fallbacks: Graceful degradation
typescript
// Example: Debugging dropped event
bus.on('api:fetch', async (payload) => {
  console.log('[Handler] api:fetch called');
  
  const result = await fetch('/api/data');
  
  console.log('[Handler] api:fetch completed');
  bus.emit('api:success', result);
});

// If you see "called" but not "completed" → handler crashed
// If you see neither → event was dropped by chaos

Next Steps

Now that your app is chaos-tested, explore intelligent features:

Quick Reference

Enable Chaos:

typescript
const bus = new Nexus({
  chaos: {
    dropRate: 0.1,        // 0.0 to 1.0 (10% = drop 1 in 10 events)
    maxDelay: 2000,       // Milliseconds
    exclude: ['critical:*'] // Event patterns to never chaos-test
  }
});

Disable Chaos:

typescript
const bus = new Nexus(); // No chaos config = disabled

Conditional Chaos:

typescript
const bus = new Nexus({
  chaos: process.env.ENABLE_CHAOS ? { dropRate: 0.1, maxDelay: 1000 } : undefined
});

Environment-based:

typescript
const bus = new Nexus({
  chaos: {
    dropRate: parseFloat(process.env.CHAOS_DROP_RATE || '0'),
    maxDelay: parseInt(process.env.CHAOS_MAX_DELAY || '0')
  }
});

Released under the MIT License.