Chaos Monkey Testing
The Fire Drill Analogy
Why do schools practice fire drills? Because when a real emergency happens, panic is the enemy. Fire drills teach muscle memory: evacuate calmly, check for stragglers, meet at the rally point.
Chaos Monkey does the same for your code. It simulates disasters in development so production failures don't surprise you.
Imagine launching your app only to discover:
- ❌ Users with slow connections see eternal loading spinners
- ❌ Dropped WebSocket messages break your chat feature
- ❌ API timeouts cause white screens
With Chaos Monkey, you discover these issues BEFORE users do.
The Problem: Production Is a Hostile Environment
Your development environment is a lie:
- Network: Localhost is instant. Real users have 200ms latency
- Reliability: APIs always respond. Production has 503 errors
- Concurrency: One user testing. Production has 10,000 simultaneous users
The "Works on My Machine" trap:
// This works in development
bus.emit('chat:message', message);
// Assumes message always arrivesBut in production:
- 10% of messages dropped (packet loss)
- 500ms delay (network congestion)
- Random errors (server restarts)
The Solution: Controlled Chaos
Nexus Chaos Monkey injects failures into your event bus:
const bus = new Nexus({
chaos: {
dropRate: 0.1, // Drop 10% of events
maxDelay: 2000, // Add up to 2s latency
exclude: ['auth:*'] // Never break authentication
}
});Every event now simulates real-world conditions. If your app works with chaos enabled, it's bulletproof.
How Chaos Monkey Works
Behind the scenes:
emit(event, payload, options) {
if (this.config.chaos && !options.fromRemote) {
const { dropRate = 0.1, maxDelay = 1000, exclude = [] } = this.config.chaos;
// Skip chaos for critical events
if (!exclude.includes(event)) {
// 1. Packet Loss Simulation
if (Math.random() < dropRate) {
console.warn(`[Chaos] 🔥 Dropped event: ${event}`);
return; // Event lost!
}
// 2. Network Latency Simulation
const delay = Math.random() * maxDelay;
if (delay > 50) {
setTimeout(() => {
// Re-emit without chaos to avoid infinite recursion
this.emitWithoutChaos(event, payload);
}, delay);
return; // Delayed execution
}
}
}
// Normal execution
executeListeners(event, payload);
}Basic Example: Chat Application
import { Nexus } from '@caeligo/nexus-orchestrator';
// Enable Chaos Monkey in development
const bus = new Nexus({
chaos: {
dropRate: 0.15, // 15% message loss (harsh!)
maxDelay: 3000, // Up to 3 seconds delay
exclude: ['app:init', 'auth:*'] // Don't break critical flows
},
debug: true // See chaos logs
});
// Send message
function sendMessage(text: string) {
const messageId = Math.random().toString(36);
bus.emit('chat:send', {
id: messageId,
text,
timestamp: Date.now()
});
// UI: Show "sending..." indicator
addMessageToUI(messageId, text, { status: 'sending' });
}
// Handle message confirmation
bus.on('chat:send', (message) => {
// This might never arrive (chaos drops it)
updateMessageStatus(message.id, 'sent');
// After 5 seconds, assume failure
setTimeout(() => {
if (getMessageStatus(message.id) === 'sending') {
updateMessageStatus(message.id, 'failed');
showRetryButton(message.id);
}
}, 5000);
});What you'll discover:
- Messages stuck in "sending" state forever
- Users see "failed" messages they need to retry
- Need to implement message queuing and retry logic
Real-World Example: E-Commerce with Chaos
const bus = new Nexus({
chaos: {
dropRate: 0.1,
maxDelay: 2000,
exclude: [
'payment:*', // Never break payments
'auth:*', // Never break auth
'security:*' // Never break security
]
}
});
// Add to cart (might fail or delay)
function addToCart(productId: number) {
// Show optimistic UI
showCartBadge('+1');
bus.emit('cart:add', { productId });
// Set timeout to detect chaos-dropped events
const timeoutId = setTimeout(() => {
// Event never arrived - rollback UI
showCartBadge('-1');
showNotification('Failed to add to cart. Please try again.');
}, 5000);
// Clear timeout on success
bus.once('cart:add-success', () => {
clearTimeout(timeoutId);
});
}
// Handler with chaos-aware logic
bus.on('cart:add', async (payload) => {
try {
await fetch('/api/cart/add', {
method: 'POST',
body: JSON.stringify(payload)
});
// Confirm success
bus.emit('cart:add-success', payload);
} catch (err) {
bus.emit('cart:add-failure', { ...payload, error: err.message });
}
}, {
attempts: 3, // Combine with resilience!
backoff: 1000,
fallback: (err) => {
console.error('Cart add failed after retries:', err);
bus.emit('cart:add-failure', { error: 'Network error' });
}
});Chaos reveals:
- Need optimistic UI updates
- Need timeout detection
- Need retry/rollback logic
- Need user feedback for failures
Edge Case: Chaos During Tests
Should you enable Chaos Monkey in automated tests?
For unit tests: ❌ No. Tests should be deterministic.
For integration tests: ✅ Yes. Create chaos-specific test suites:
import { describe, it, expect } from 'vitest';
import { Nexus } from '@caeligo/nexus-orchestrator';
describe('Chat with Chaos', () => {
it('should handle dropped messages', async () => {
const bus = new Nexus({
chaos: { dropRate: 0.5, maxDelay: 100 }
});
let receivedCount = 0;
bus.on('message', () => receivedCount++);
// Send 100 messages
for (let i = 0; i < 100; i++) {
bus.emit('message', { id: i });
}
await new Promise(resolve => setTimeout(resolve, 500));
// With 50% drop rate, expect ~50 messages
expect(receivedCount).toBeGreaterThan(30);
expect(receivedCount).toBeLessThan(70);
});
});Edge Case: Selective Chaos
Different chaos levels for different event types:
// Custom chaos wrapper
function emitWithChaos(event: string, payload: any, chaosLevel: number) {
const shouldDrop = Math.random() < chaosLevel;
if (shouldDrop) {
console.warn(`[Custom Chaos] Dropped ${event}`);
return;
}
bus.emit(event, payload);
}
// Usage
emitWithChaos('analytics:track', data, 0.3); // 30% chaos
emitWithChaos('ui:update', data, 0.05); // 5% chaosPattern: Chaos Dashboard
Visualize chaos impact in development:
let chaosStats = {
dropped: 0,
delayed: 0,
total: 0
};
// Override emit to track chaos
const originalEmit = bus.emit.bind(bus);
bus.emit = function(event, payload, options) {
chaosStats.total++;
// Check if chaos will trigger (simplified)
if (this.config.chaos && Math.random() < this.config.chaos.dropRate) {
chaosStats.dropped++;
}
return originalEmit(event, payload, options);
};
// Dashboard UI
function renderChaosDashboard() {
const dropRate = (chaosStats.dropped / chaosStats.total * 100).toFixed(1);
return `
<div class="chaos-dashboard">
<h3>🐵 Chaos Monkey Stats</h3>
<p>Total Events: ${chaosStats.total}</p>
<p>Dropped: ${chaosStats.dropped} (${dropRate}%)</p>
<p>Delayed: ${chaosStats.delayed}</p>
</div>
`;
}
// Update every second
setInterval(() => {
document.getElementById('chaos-stats').innerHTML = renderChaosDashboard();
}, 1000);Pattern: Graduated Chaos
Start with mild chaos, increase over time:
let chaosIntensity = 0.05; // Start at 5%
const bus = new Nexus({
get chaos() {
return {
dropRate: chaosIntensity,
maxDelay: chaosIntensity * 10000 // 5% = 500ms max
};
}
});
// Increase chaos every minute
setInterval(() => {
chaosIntensity = Math.min(chaosIntensity + 0.05, 0.5); // Max 50%
console.log(`Chaos intensity: ${(chaosIntensity * 100).toFixed(0)}%`);
}, 60000);Pattern: Chaos Scenarios
Test specific failure modes:
// Scenario 1: Total Network Outage
function simulateNetworkOutage(duration: number) {
const originalChaos = bus.config.chaos;
bus.config.chaos = {
dropRate: 1.0, // Drop everything
maxDelay: 0
};
setTimeout(() => {
bus.config.chaos = originalChaos;
console.log('Network restored');
}, duration);
}
// Scenario 2: High Latency
function simulateHighLatency() {
bus.config.chaos = {
dropRate: 0,
maxDelay: 10000 // 10 second delays!
};
}
// Scenario 3: Packet Loss Storm
function simulatePacketLoss() {
bus.config.chaos = {
dropRate: 0.8, // 80% loss
maxDelay: 500
};
}
// Test button in UI
document.getElementById('test-outage')?.addEventListener('click', () => {
simulateNetworkOutage(5000);
});Comparison with Other Testing Approaches
| Approach | Realism | Effort | Coverage |
|---|---|---|---|
| Manual Testing | Low | High | Spotty |
| Unit Tests | Low | Medium | Good |
| Integration Tests | Medium | High | Good |
| Chaos Monkey | High | Low | Excellent |
Safety Checklist
Before enabling Chaos Monkey:
- [ ] Only enable in development and staging
- [ ] Never enable in production (obvious, but worth stating)
- [ ] Exclude critical events (auth, payments, security)
- [ ] Warn team members (console message on init)
- [ ] Document chaos behavior in README
- [ ] Use environment variables to control chaos
const bus = new Nexus({
chaos: process.env.NODE_ENV === 'development' ? {
dropRate: 0.1,
maxDelay: 2000,
exclude: ['auth:*', 'payment:*']
} : undefined // Disabled in production
});Debugging Chaos Issues
When chaos reveals a bug:
- Check console: Chaos logs dropped events
- Disable chaos: Verify bug is chaos-related
- Add timeouts: Detect missing events
- Add retries: Use resilience features
- Add fallbacks: Graceful degradation
// Example: Debugging dropped event
bus.on('api:fetch', async (payload) => {
console.log('[Handler] api:fetch called');
const result = await fetch('/api/data');
console.log('[Handler] api:fetch completed');
bus.emit('api:success', result);
});
// If you see "called" but not "completed" → handler crashed
// If you see neither → event was dropped by chaosNext Steps
Now that your app is chaos-tested, explore intelligent features:
- AI Prediction - Learn patterns and prevent failures
- Teleportation - Chaos works across network boundaries
- Pipes - Add chaos-resistant data transformation
Quick Reference
Enable Chaos:
const bus = new Nexus({
chaos: {
dropRate: 0.1, // 0.0 to 1.0 (10% = drop 1 in 10 events)
maxDelay: 2000, // Milliseconds
exclude: ['critical:*'] // Event patterns to never chaos-test
}
});Disable Chaos:
const bus = new Nexus(); // No chaos config = disabledConditional Chaos:
const bus = new Nexus({
chaos: process.env.ENABLE_CHAOS ? { dropRate: 0.1, maxDelay: 1000 } : undefined
});Environment-based:
const bus = new Nexus({
chaos: {
dropRate: parseFloat(process.env.CHAOS_DROP_RATE || '0'),
maxDelay: parseInt(process.env.CHAOS_MAX_DELAY || '0')
}
});