Building an Incident Response Playbook
Building an Incident Response Playbook
When a security incident happens, every second counts. Having a well-documented incident response playbook is the difference between contained incidents and catastrophic breaches. I’ve developed and executed incident response playbooks for organizations handling everything from ransomware attacks to data breaches, and today I’m sharing how to build yours.
Why You Need an Incident Response Playbook
Without a playbook:
- Panic and confusion during incidents
- Inconsistent response procedures
- Critical steps forgotten
- Delayed response times
- Higher breach costs
With a playbook:
- Calm, coordinated response
- Standardized procedures
- Nothing overlooked
- Faster containment
- Lower overall costs
Statistics:
- Organizations with IR plans save $2.66M on average per breach
- Mean time to identify a breach: 207 days
- Mean time to contain a breach: 73 days
- Average cost of a data breach: $4.45M
Let’s build a playbook that reduces these numbers.
The Incident Response Lifecycle
Based on NIST SP 800-61, the incident response lifecycle has six phases:
1. Preparation
2. Detection & Analysis
3. Containment
4. Eradication
5. Recovery
6. Post-Incident Activity
We’ll create runbooks for each phase.
Phase 1: Preparation
Incident Response Team
# incident-response-team.yml
roles:
incident_commander:
name: "Sarah Johnson"
title: "CISO"
phone: "+1-555-0101"
email: "sarah.johnson@company.com"
backup: "Mike Chen"
responsibilities:
- Lead incident response
- Coordinate team activities
- Communicate with executives
- Make critical decisions
technical_lead:
name: "Mike Chen"
title: "Security Engineer"
phone: "+1-555-0102"
email: "mike.chen@company.com"
backup: "Alex Rivera"
responsibilities:
- Technical investigation
- Containment actions
- System recovery
- Root cause analysis
communications_lead:
name: "Emma Davis"
title: "Head of Communications"
phone: "+1-555-0103"
email: "emma.davis@company.com"
responsibilities:
- Internal communications
- External communications
- Customer notifications
- Media relations
legal_counsel:
name: "Robert Martinez"
title: "General Counsel"
phone: "+1-555-0104"
email: "robert.martinez@company.com"
responsibilities:
- Legal compliance
- Regulatory notifications
- Law enforcement coordination
- Liability assessment
escalation:
level_1: # Minor incident
- technical_lead
- incident_commander
level_2: # Major incident
- technical_lead
- incident_commander
- communications_lead
level_3: # Critical incident (data breach)
- All hands on deck
- Executive team notification
- Legal counsel
- External consultants
contact_list:
emergency_hotline: "+1-555-INCIDENT"
security_email: "security@company.com"
war_room_zoom: "https://zoom.us/j/incident-response"
slack_channel: "#incident-response"
Communication Templates
# Initial Incident Alert Template
**SUBJECT**: [SEVERITY] Security Incident Detected - [BRIEF DESCRIPTION]
**Incident ID**: INC-2025-1001
**Detected**: 2025-10-08 14:23:00 UTC
**Severity**: HIGH
**Status**: INVESTIGATING
## Summary
Brief description of the incident (2-3 sentences).
## Impact
- Systems affected: [List]
- Services impacted: [List]
- Customer impact: [Yes/No/Unknown]
## Actions Taken
1. Incident response team assembled
2. Preliminary investigation started
3. Affected systems isolated (if applicable)
## Next Steps
1. Complete initial assessment by [TIME]
2. Status update at [TIME]
3. [Other actions]
## Response Team
- Incident Commander: [Name]
- Technical Lead: [Name]
- On-call: [Names]
## Communication
- War room: [Link]
- Status updates: Every [X] hours
- Next update: [TIME]
---
This is an internal notification. Do NOT share externally.
Incident Severity Classification
// incident-severity.js
const SEVERITY_LEVELS = {
CRITICAL: {
level: 1,
sla_response: '15 minutes',
sla_update: '1 hour',
criteria: [
'Active data breach',
'Ransomware encryption in progress',
'Complete service outage',
'Customer data exposed publicly',
'Ongoing financial fraud',
],
notification: ['All executives', 'Board of directors', 'All IR team'],
example: 'Production database containing customer PII publicly accessible',
},
HIGH: {
level: 2,
sla_response: '30 minutes',
sla_update: '2 hours',
criteria: [
'Suspected data breach',
'Malware detected on multiple systems',
'Successful phishing attack',
'Major service degradation',
'Unauthorized access detected',
],
notification: ['IR team', 'Department heads', 'CISO'],
example: 'Admin credentials compromised, unauthorized access to internal systems',
},
MEDIUM: {
level: 3,
sla_response: '2 hours',
sla_update: '8 hours',
criteria: [
'Malware detected on single system',
'Suspicious network activity',
'Failed attack attempts',
'Minor service disruption',
'Policy violations',
],
notification: ['IR team', 'Security team'],
example: 'Malware detected on employee workstation, isolated successfully',
},
LOW: {
level: 4,
sla_response: '8 hours',
sla_update: '24 hours',
criteria: [
'Security alerts requiring investigation',
'Potential vulnerabilities',
'Minor anomalies',
],
notification: ['Security team'],
example: 'Unusual login pattern detected, investigating',
},
};
function classifyIncident(indicators) {
// Automatic severity classification
if (indicators.data_breach || indicators.ransomware) {
return SEVERITY_LEVELS.CRITICAL;
}
if (indicators.unauthorized_access || indicators.malware_spread) {
return SEVERITY_LEVELS.HIGH;
}
if (indicators.malware_isolated) {
return SEVERITY_LEVELS.MEDIUM;
}
return SEVERITY_LEVELS.LOW;
}
Phase 2: Detection & Analysis
Detection Runbook
# Incident Detection Runbook
## Automated Detection Sources
### 1. SIEM Alerts
**System**: Splunk
**Alert Types**:
- Multiple failed login attempts
- Privilege escalation
- Unusual data transfers
- Malware signatures
- Network anomalies
**Action**:
1. Review alert in Splunk
2. Verify not a false positive
3. Create incident ticket if confirmed
4. Escalate based on severity
### 2. EDR (Endpoint Detection & Response)
**System**: CrowdStrike Falcon
**Alert Types**:
- Malware execution
- Ransomware behavior
- Suspicious process activity
- Unauthorized software
**Action**:
1. Review detection in Falcon console
2. Check process tree and IOCs
3. Isolate endpoint if active threat
4. Collect forensic data
5. Create incident ticket
### 3. Network IDS/IPS
**System**: Suricata
**Alert Types**:
- C2 communication
- Port scanning
- Data exfiltration
- Known attack patterns
**Action**:
1. Review PCAP in Wireshark
2. Identify source/destination
3. Block malicious IPs at firewall
4. Isolate affected systems
5. Create incident ticket
### 4. Cloud Security
**System**: AWS GuardDuty
**Alert Types**:
- Unusual API calls
- Instance compromise
- IAM anomalies
- Cryptocurrency mining
**Action**:
1. Review finding details
2. Check CloudTrail logs
3. Snapshot affected instances
4. Revoke compromised credentials
5. Create incident ticket
## Manual Detection Sources
### User Reports
- Phishing emails
- Suspicious activity
- System anomalies
- Data concerns
### Threat Intelligence
- CVE announcements
- Threat actor TTPs
- IOC feeds
- Security advisories
## Initial Analysis Checklist
- [ ] Collect all available data
- [ ] Identify affected systems
- [ ] Determine attack vector
- [ ] Assess scope and impact
- [ ] Classify severity
- [ ] Preserve evidence
- [ ] Document timeline
- [ ] Notify IR team
Analysis Automation
# incident-analysis.py
import json
from datetime import datetime, timedelta
class IncidentAnalyzer:
def __init__(self):
self.siem = SIEMConnector()
self.edr = EDRConnector()
self.threat_intel = ThreatIntelConnector()
def analyze_alert(self, alert_id):
"""Automated incident analysis"""
# 1. Gather context
alert = self.siem.get_alert(alert_id)
context = {
'alert_id': alert_id,
'timestamp': alert['timestamp'],
'source_ip': alert['source_ip'],
'destination_ip': alert['dest_ip'],
'user': alert['user'],
'hostname': alert['hostname'],
}
# 2. Check threat intelligence
threat_data = self.threat_intel.lookup_ip(context['source_ip'])
if threat_data['malicious']:
context['threat_intel'] = {
'reputation': threat_data['reputation'],
'categories': threat_data['categories'],
'last_seen': threat_data['last_seen'],
}
# 3. Query related events
timeframe_start = alert['timestamp'] - timedelta(hours=24)
timeframe_end = alert['timestamp'] + timedelta(hours=1)
related_events = self.siem.query({
'query': f'source_ip:{context["source_ip"]} OR dest_ip:{context["source_ip"]}',
'start_time': timeframe_start,
'end_time': timeframe_end,
})
context['related_events'] = len(related_events)
context['event_timeline'] = self._build_timeline(related_events)
# 4. Check for lateral movement
lateral_movement = self._detect_lateral_movement(
context['hostname'],
timeframe_start,
timeframe_end
)
context['lateral_movement'] = lateral_movement
# 5. Check endpoint status
endpoint_status = self.edr.get_device_status(context['hostname'])
context['endpoint'] = {
'online': endpoint_status['online'],
'last_seen': endpoint_status['last_seen'],
'malware_detected': endpoint_status['malware_count'] > 0,
'isolation_status': endpoint_status['network_isolated'],
}
# 6. Assess severity
severity = self._calculate_severity(context)
context['severity'] = severity
# 7. Generate recommendations
context['recommendations'] = self._generate_recommendations(context)
# 8. Create incident ticket
incident_id = self._create_incident(context)
context['incident_id'] = incident_id
return context
def _calculate_severity(self, context):
"""Calculate incident severity based on indicators"""
score = 0
# Threat intelligence match
if context.get('threat_intel'):
score += 30
# Lateral movement detected
if context.get('lateral_movement'):
score += 40
# Malware detected
if context['endpoint']['malware_detected']:
score += 30
# Multiple related events
if context['related_events'] > 10:
score += 20
if score >= 70:
return 'CRITICAL'
elif score >= 40:
return 'HIGH'
elif score >= 20:
return 'MEDIUM'
else:
return 'LOW'
def _generate_recommendations(self, context):
"""Generate automated response recommendations"""
recommendations = []
if context['endpoint']['malware_detected']:
recommendations.append({
'action': 'isolate_endpoint',
'priority': 'IMMEDIATE',
'command': f'falcon isolate {context["hostname"]}',
})
if context.get('threat_intel'):
recommendations.append({
'action': 'block_ip',
'priority': 'IMMEDIATE',
'command': f'firewall block {context["source_ip"]}',
})
if context.get('lateral_movement'):
recommendations.append({
'action': 'reset_credentials',
'priority': 'HIGH',
'details': f'Reset credentials for user {context["user"]}',
})
return recommendations
Phase 3: Containment
Containment Runbook
# Containment Runbook
## Objective
Stop the incident from spreading while preserving evidence.
## Short-Term Containment
### Network Isolation
**Isolate Compromised System**:
```bash
# Using CrowdStrike Falcon
falcon contain <hostname>
# OR using firewall
iptables -A INPUT -s <compromised_ip> -j DROP
iptables -A OUTPUT -d <compromised_ip> -j DROP
# Verify isolation
ping <compromised_ip> # Should fail
Block Malicious IPs:
# Using pfSense firewall
pfctl -t malicious_ips -T add <ip_address>
# Using AWS Security Groups
aws ec2 revoke-security-group-ingress \
--group-id sg-xxxxxxxx \
--protocol tcp \
--port 0-65535 \
--cidr <ip_address>/32
Account Lockout
Disable Compromised Account:
# Active Directory
Disable-ADAccount -Identity <username>
# AWS IAM
aws iam update-user --user-name <username> --no-active
# Force password reset
aws iam update-login-profile \
--user-name <username> \
--password-reset-required
Revoke Active Sessions:
# AWS
aws iam delete-access-key --user-name <username> --access-key-id <key>
# Kubernetes
kubectl delete secret <service-account-token>
# Database
REVOKE ALL PRIVILEGES ON *.* FROM 'username'@'host';
Snapshot Evidence
Preserve System State:
# Create disk snapshot (AWS)
aws ec2 create-snapshot \
--volume-id vol-xxxxxxxx \
--description "Forensic snapshot - INC-2025-1001"
# Create memory dump
sudo volatility -f /dev/mem dumpfile \
--output=memory-dump-$(date +%Y%m%d-%H%M%S).raw
# Collect logs
tar -czf logs-$(hostname)-$(date +%Y%m%d).tar.gz /var/log/
Long-Term Containment
Implement Workarounds
Example: Patch not available:
- Deploy WAF rules to block exploit attempts
- Implement network segmentation
- Add enhanced monitoring
- Disable vulnerable features
- Document all changes
Strengthen Defenses
Post-Containment Hardening:
hardening_checklist:
network:
- [ ] Review firewall rules
- [ ] Update IDS/IPS signatures
- [ ] Enable additional logging
- [ ] Implement network segmentation
authentication:
- [ ] Force password resets
- [ ] Enable MFA for all admin accounts
- [ ] Review access privileges
- [ ] Audit service accounts
systems:
- [ ] Apply security patches
- [ ] Update antivirus signatures
- [ ] Harden system configurations
- [ ] Remove unnecessary services
monitoring:
- [ ] Add detection rules
- [ ] Increase log retention
- [ ] Set up anomaly detection
- [ ] Create dashboards
Containment Decision Matrix
| Scenario | Immediate Action | Secondary Action |
|---|---|---|
| Ransomware detected | Isolate all affected systems | Disable admin accounts |
| Data exfiltration | Block external IPs | Snapshot systems |
| Malware outbreak | Isolate patient zero | Update AV signatures |
| Compromised admin | Disable account | Rotate all credentials |
| DDoS attack | Enable DDoS protection | Contact ISP |
| SQL injection | WAF block pattern | Patch vulnerability |
## Phase 4: Eradication
### Eradication Runbook
```bash
#!/bin/bash
# eradicate-malware.sh
set -e
HOSTNAME=$1
INCIDENT_ID=$2
echo "[$(date)] Starting eradication for $HOSTNAME (Incident: $INCIDENT_ID)"
# 1. Verify system is isolated
if ping -c 1 $HOSTNAME > /dev/null 2>&1; then
echo "ERROR: System is not isolated. Aborting."
exit 1
fi
# 2. Kill malicious processes
echo "Terminating malicious processes..."
ssh $HOSTNAME "pkill -9 -f 'malware_process_name'"
# 3. Remove malware files
echo "Removing malware files..."
ssh $HOSTNAME "
rm -f /tmp/.hidden_malware
rm -f /var/tmp/malicious_script.sh
find /home -name '.crypted' -delete
"
# 4. Remove persistence mechanisms
echo "Removing persistence..."
ssh $HOSTNAME "
# Remove cron jobs
crontab -r
# Remove systemd services
systemctl disable malicious.service
rm -f /etc/systemd/system/malicious.service
# Remove startup scripts
rm -f /etc/rc.local
"
# 5. Clean registry (if Windows)
if [[ $OS == "Windows" ]]; then
echo "Cleaning Windows registry..."
# PowerShell commands to remove registry keys
fi
# 6. Remove backdoors
echo "Checking for backdoors..."
ssh $HOSTNAME "
# Check for unauthorized SSH keys
find /home -name 'authorized_keys' -exec cat {} \;
# Check for setuid binaries
find / -perm -4000 -type f 2>/dev/null
# Check for listening ports
netstat -tlnp
"
# 7. Rebuild if necessary
read -p "Rebuild system from clean image? (y/n) " -n 1 -r
echo
if [[ $REPLY =~ ^[Yy]$ ]]; then
echo "Rebuilding from clean image..."
# Trigger rebuild automation
fi
# 8. Verify eradication
echo "Running verification scans..."
ssh $HOSTNAME "
# Full antivirus scan
clamscan -r /
# Rootkit scan
rkhunter --check
# Malware scan
maldet -a /
"
echo "[$(date)] Eradication complete for $HOSTNAME"
Phase 5: Recovery
Recovery Runbook
# Recovery Runbook
## Pre-Recovery Checklist
- [ ] Incident fully eradicated (verified)
- [ ] Systems hardened and patched
- [ ] Credentials rotated
- [ ] Monitoring enhanced
- [ ] Backups verified clean
- [ ] Recovery plan approved
## Recovery Steps
### 1. Restore from Backup
**Verify Backup Integrity**:
```bash
# Check backup date (before incident)
aws s3 ls s3://backups/ | grep $(date -d "7 days ago" +%Y-%m-%d)
# Verify backup hash
sha256sum backup.tar.gz
# Compare with stored hash
# Test restore in isolated environment
restore_backup.sh --verify --isolated
Restore Production Data:
# Database restore
pg_restore -h localhost -U postgres -d production backup.sql
# File restore
aws s3 sync s3://backups/2025-10-01/ /data/restore/
# Verify data integrity
./verify_data_integrity.sh
2. Rebuild Compromised Systems
Infrastructure as Code:
# terraform/production.tf
# Rebuild from clean state
resource "aws_instance" "web_server" {
ami = data.aws_ami.hardened_ami.id # Pre-incident AMI
instance_type = "t3.medium"
vpc_security_group_ids = [aws_security_group.web.id]
subnet_id = aws_subnet.public.id
user_data = file("user_data.sh")
tags = {
Name = "web-server-rebuilt"
IncidentRecovery = "INC-2025-1001"
}
}
3. Gradual Service Restoration
Phased Approach:
recovery_phases:
phase_1:
duration: "2 hours"
services:
- Internal testing environment
validation:
- Functionality tests
- Security scans
- Performance tests
phase_2:
duration: "4 hours"
services:
- Limited user access (10% traffic)
validation:
- User acceptance testing
- Monitoring alerts
- No anomalies detected
phase_3:
duration: "8 hours"
services:
- Increased user access (50% traffic)
validation:
- Continued monitoring
- Performance metrics normal
- Security baseline maintained
phase_4:
services:
- Full production restoration
validation:
- All systems operational
- 24hr monitoring period
- Incident closed
4. Enhanced Monitoring
Post-Recovery Monitoring:
# alerting-rules.yml
groups:
- name: post_incident_monitoring
interval: 1m
rules:
- alert: SuspiciousActivityDetected
expr: rate(failed_login_attempts[5m]) > 5
for: 5m
labels:
severity: high
incident: INC-2025-1001
annotations:
summary: "Suspicious activity detected post-recovery"
- alert: UnusualNetworkTraffic
expr: rate(network_bytes_out[5m]) > 1000000000
for: 5m
labels:
severity: high
incident: INC-2025-1001
annotations:
summary: "Unusual network traffic detected"
- alert: UnauthorizedProcessExecution
expr: suspicious_process_count > 0
for: 1m
labels:
severity: critical
incident: INC-2025-1001
annotations:
summary: "Unauthorized process detected"
Recovery Validation
- All services restored and functional
- No signs of attacker presence
- Monitoring shows normal patterns
- User access restored
- Performance metrics normal
- Security scans clean
- 24-hour monitoring period completed
## Phase 6: Post-Incident Activity
### Post-Incident Review Template
```markdown
# Post-Incident Review - INC-2025-1001
## Incident Summary
**Incident ID**: INC-2025-1001
**Date Detected**: 2025-10-08 14:23 UTC
**Date Resolved**: 2025-10-10 09:15 UTC
**Duration**: 42 hours, 52 minutes
**Severity**: HIGH
## Executive Summary
Brief narrative of what happened, impact, and resolution.
## Timeline
| Time (UTC) | Event | Action Taken |
|------------|-------|--------------|
| 2025-10-08 14:23 | SIEM alert: Multiple failed login attempts | Security team notified |
| 2025-10-08 14:35 | Confirmed: Brute force attack in progress | Blocked source IPs, enabled rate limiting |
| 2025-10-08 15:10 | Detected: Successful admin login from suspicious IP | Disabled compromised account, initiated incident response |
| 2025-10-08 15:30 | Analysis: Attacker accessed customer database | Isolated database server, created forensic snapshot |
| ... | ... | ... |
## Root Cause Analysis
**What happened**: [Detailed description]
**Why it happened**:
- Weak password on admin account (no MFA enforced)
- Missing rate limiting on login endpoint
- Delayed alert notification (20 min)
**How it was detected**: SIEM alert for multiple failed logins
## Impact Assessment
**Systems Affected**:
- Web application server (app-prod-01)
- PostgreSQL database (db-prod-01)
- Admin portal
**Data Exposure**:
- Customer PII: ~10,000 records accessed
- Financial data: None
- Health data: None
**Business Impact**:
- Service downtime: 6 hours
- Customer notification required: Yes
- Regulatory reporting required: Yes (GDPR, within 72 hours)
- Estimated cost: $150,000
## What Went Well
1. Incident detected within 12 minutes
2. Response team assembled quickly
3. Clear communication throughout
4. Evidence preserved properly
5. No data exfiltration confirmed
## What Went Wrong
1. MFA not enforced on admin accounts
2. Weak password policy allowed compromise
3. Missing rate limiting enabled brute force
4. Alert notification delay (20 minutes)
5. Recovery took longer than expected (backup issues)
## Action Items
| # | Action | Owner | Due Date | Status |
|----|--------|-------|----------|--------|
| 1 | Enforce MFA on all admin accounts | Security Team | 2025-10-15 | ✅ Complete |
| 2 | Implement rate limiting on login endpoint | Dev Team | 2025-10-20 | In Progress |
| 3 | Update password policy (15 char minimum) | IT Team | 2025-10-15 | ✅ Complete |
| 4 | Reduce alert notification time to <5 min | Security Team | 2025-10-18 | Planned |
| 5 | Test backup restoration procedures | Ops Team | 2025-10-22 | Planned |
| 6 | Add additional monitoring for admin logins | Security Team | 2025-10-17 | In Progress |
| 7 | Security awareness training (phishing) | HR Team | 2025-11-01 | Planned |
## Lessons Learned
1. **Prevention**: MFA would have prevented this incident entirely
2. **Detection**: Rate limiting would have slowed/stopped attack
3. **Response**: Having a playbook helped coordinate response
4. **Recovery**: Need to test backup procedures more frequently
## Recommendations
### Immediate (0-30 days)
- Complete all action items above
- Conduct security awareness training
- Perform penetration test
### Short-term (30-90 days)
- Implement security automation (SOAR)
- Enhance logging and monitoring
- Update incident response playbook
### Long-term (90+ days)
- Implement zero-trust architecture
- Deploy deception technology
- Conduct red team exercise
## Sign-off
**Prepared by**: Mike Chen, Technical Lead
**Reviewed by**: Sarah Johnson, CISO
**Approved by**: John Smith, CTO
**Date**: 2025-10-12
Key Takeaways
- Preparation is everything - Build your playbook before you need it
- Practice makes perfect - Run tabletop exercises quarterly
- Clear communication - Use templates and defined roles
- Document everything - During and after incidents
- Learn and improve - Every incident makes you stronger
- Automate when possible - Speed matters in incidents
- Test your backups - Recovery is only possible with good backups
Resources
- NIST SP 800-61 Rev. 2
- SANS Incident Response Process
- AWS Incident Response Guide
- CIS Incident Response Guide
Conclusion
A well-crafted incident response playbook is your best defense when incidents occur. It provides structure during chaos, ensures consistent responses, and significantly reduces the time to contain and recover from security incidents.
The playbook I’ve shared here is a starting point. Customize it for your environment, practice it regularly, and update it after every incident. Your future self (during an incident) will thank you.
Remember: It’s not a matter of IF you’ll have an incident, but WHEN. Be prepared.
Published: October 8, 2025