Building an Incident Response Playbook

When a security incident happens, every second counts. Having a well-documented incident response playbook is the difference between contained incidents and catastrophic breaches. I’ve developed and executed incident response playbooks for organizations handling everything from ransomware attacks to data breaches, and today I’m sharing how to build yours.

Why You Need an Incident Response Playbook

Without a playbook:

Panic and confusion during incidents
Inconsistent response procedures
Critical steps forgotten
Delayed response times
Higher breach costs

With a playbook:

Calm, coordinated response
Standardized procedures
Nothing overlooked
Faster containment
Lower overall costs

Statistics:

Organizations with IR plans save $2.66M on average per breach
Mean time to identify a breach: 207 days
Mean time to contain a breach: 73 days
Average cost of a data breach: $4.45M

Let’s build a playbook that reduces these numbers.

The Incident Response Lifecycle

Based on NIST SP 800-61, the incident response lifecycle has six phases:

1. Preparation
2. Detection & Analysis
3. Containment
4. Eradication
5. Recovery
6. Post-Incident Activity

We’ll create runbooks for each phase.

Phase 1: Preparation

Incident Response Team

# incident-response-team.yml
roles:
  incident_commander:
    name: "Sarah Johnson"
    title: "CISO"
    phone: "+1-555-0101"
    email: "sarah.johnson@company.com"
    backup: "Mike Chen"
    responsibilities:
      - Lead incident response
      - Coordinate team activities
      - Communicate with executives
      - Make critical decisions

  technical_lead:
    name: "Mike Chen"
    title: "Security Engineer"
    phone: "+1-555-0102"
    email: "mike.chen@company.com"
    backup: "Alex Rivera"
    responsibilities:
      - Technical investigation
      - Containment actions
      - System recovery
      - Root cause analysis

  communications_lead:
    name: "Emma Davis"
    title: "Head of Communications"
    phone: "+1-555-0103"
    email: "emma.davis@company.com"
    responsibilities:
      - Internal communications
      - External communications
      - Customer notifications
      - Media relations

  legal_counsel:
    name: "Robert Martinez"
    title: "General Counsel"
    phone: "+1-555-0104"
    email: "robert.martinez@company.com"
    responsibilities:
      - Legal compliance
      - Regulatory notifications
      - Law enforcement coordination
      - Liability assessment

escalation:
  level_1: # Minor incident
    - technical_lead
    - incident_commander
  level_2: # Major incident
    - technical_lead
    - incident_commander
    - communications_lead
  level_3: # Critical incident (data breach)
    - All hands on deck
    - Executive team notification
    - Legal counsel
    - External consultants

contact_list:
  emergency_hotline: "+1-555-INCIDENT"
  security_email: "security@company.com"
  war_room_zoom: "https://zoom.us/j/incident-response"
  slack_channel: "#incident-response"

Communication Templates

# Initial Incident Alert Template

**SUBJECT**: [SEVERITY] Security Incident Detected - [BRIEF DESCRIPTION]

**Incident ID**: INC-2025-1001
**Detected**: 2025-10-08 14:23:00 UTC
**Severity**: HIGH
**Status**: INVESTIGATING

## Summary
Brief description of the incident (2-3 sentences).

## Impact
- Systems affected: [List]
- Services impacted: [List]
- Customer impact: [Yes/No/Unknown]

## Actions Taken
1. Incident response team assembled
2. Preliminary investigation started
3. Affected systems isolated (if applicable)

## Next Steps
1. Complete initial assessment by [TIME]
2. Status update at [TIME]
3. [Other actions]

## Response Team
- Incident Commander: [Name]
- Technical Lead: [Name]
- On-call: [Names]

## Communication
- War room: [Link]
- Status updates: Every [X] hours
- Next update: [TIME]

---
This is an internal notification. Do NOT share externally.

Incident Severity Classification

// incident-severity.js
const SEVERITY_LEVELS = {
  CRITICAL: {
    level: 1,
    sla_response: '15 minutes',
    sla_update: '1 hour',
    criteria: [
      'Active data breach',
      'Ransomware encryption in progress',
      'Complete service outage',
      'Customer data exposed publicly',
      'Ongoing financial fraud',
    ],
    notification: ['All executives', 'Board of directors', 'All IR team'],
    example: 'Production database containing customer PII publicly accessible',
  },

  HIGH: {
    level: 2,
    sla_response: '30 minutes',
    sla_update: '2 hours',
    criteria: [
      'Suspected data breach',
      'Malware detected on multiple systems',
      'Successful phishing attack',
      'Major service degradation',
      'Unauthorized access detected',
    ],
    notification: ['IR team', 'Department heads', 'CISO'],
    example: 'Admin credentials compromised, unauthorized access to internal systems',
  },

  MEDIUM: {
    level: 3,
    sla_response: '2 hours',
    sla_update: '8 hours',
    criteria: [
      'Malware detected on single system',
      'Suspicious network activity',
      'Failed attack attempts',
      'Minor service disruption',
      'Policy violations',
    ],
    notification: ['IR team', 'Security team'],
    example: 'Malware detected on employee workstation, isolated successfully',
  },

  LOW: {
    level: 4,
    sla_response: '8 hours',
    sla_update: '24 hours',
    criteria: [
      'Security alerts requiring investigation',
      'Potential vulnerabilities',
      'Minor anomalies',
    ],
    notification: ['Security team'],
    example: 'Unusual login pattern detected, investigating',
  },
};

function classifyIncident(indicators) {
  // Automatic severity classification
  if (indicators.data_breach || indicators.ransomware) {
    return SEVERITY_LEVELS.CRITICAL;
  }
  if (indicators.unauthorized_access || indicators.malware_spread) {
    return SEVERITY_LEVELS.HIGH;
  }
  if (indicators.malware_isolated) {
    return SEVERITY_LEVELS.MEDIUM;
  }
  return SEVERITY_LEVELS.LOW;
}

Phase 2: Detection & Analysis

Detection Runbook

# Incident Detection Runbook

## Automated Detection Sources

### 1. SIEM Alerts
**System**: Splunk
**Alert Types**:
- Multiple failed login attempts
- Privilege escalation
- Unusual data transfers
- Malware signatures
- Network anomalies

**Action**:
1. Review alert in Splunk
2. Verify not a false positive
3. Create incident ticket if confirmed
4. Escalate based on severity

### 2. EDR (Endpoint Detection & Response)
**System**: CrowdStrike Falcon
**Alert Types**:
- Malware execution
- Ransomware behavior
- Suspicious process activity
- Unauthorized software

**Action**:
1. Review detection in Falcon console
2. Check process tree and IOCs
3. Isolate endpoint if active threat
4. Collect forensic data
5. Create incident ticket

### 3. Network IDS/IPS
**System**: Suricata
**Alert Types**:
- C2 communication
- Port scanning
- Data exfiltration
- Known attack patterns

**Action**:
1. Review PCAP in Wireshark
2. Identify source/destination
3. Block malicious IPs at firewall
4. Isolate affected systems
5. Create incident ticket

### 4. Cloud Security
**System**: AWS GuardDuty
**Alert Types**:
- Unusual API calls
- Instance compromise
- IAM anomalies
- Cryptocurrency mining

**Action**:
1. Review finding details
2. Check CloudTrail logs
3. Snapshot affected instances
4. Revoke compromised credentials
5. Create incident ticket

## Manual Detection Sources

### User Reports
- Phishing emails
- Suspicious activity
- System anomalies
- Data concerns

### Threat Intelligence
- CVE announcements
- Threat actor TTPs
- IOC feeds
- Security advisories

## Initial Analysis Checklist

- [ ] Collect all available data
- [ ] Identify affected systems
- [ ] Determine attack vector
- [ ] Assess scope and impact
- [ ] Classify severity
- [ ] Preserve evidence
- [ ] Document timeline
- [ ] Notify IR team

Analysis Automation

# incident-analysis.py
import json
from datetime import datetime, timedelta

class IncidentAnalyzer:
    def __init__(self):
        self.siem = SIEMConnector()
        self.edr = EDRConnector()
        self.threat_intel = ThreatIntelConnector()

    def analyze_alert(self, alert_id):
        """Automated incident analysis"""

        # 1. Gather context
        alert = self.siem.get_alert(alert_id)

        context = {
            'alert_id': alert_id,
            'timestamp': alert['timestamp'],
            'source_ip': alert['source_ip'],
            'destination_ip': alert['dest_ip'],
            'user': alert['user'],
            'hostname': alert['hostname'],
        }

        # 2. Check threat intelligence
        threat_data = self.threat_intel.lookup_ip(context['source_ip'])
        if threat_data['malicious']:
            context['threat_intel'] = {
                'reputation': threat_data['reputation'],
                'categories': threat_data['categories'],
                'last_seen': threat_data['last_seen'],
            }

        # 3. Query related events
        timeframe_start = alert['timestamp'] - timedelta(hours=24)
        timeframe_end = alert['timestamp'] + timedelta(hours=1)

        related_events = self.siem.query({
            'query': f'source_ip:{context["source_ip"]} OR dest_ip:{context["source_ip"]}',
            'start_time': timeframe_start,
            'end_time': timeframe_end,
        })

        context['related_events'] = len(related_events)
        context['event_timeline'] = self._build_timeline(related_events)

        # 4. Check for lateral movement
        lateral_movement = self._detect_lateral_movement(
            context['hostname'],
            timeframe_start,
            timeframe_end
        )
        context['lateral_movement'] = lateral_movement

        # 5. Check endpoint status
        endpoint_status = self.edr.get_device_status(context['hostname'])
        context['endpoint'] = {
            'online': endpoint_status['online'],
            'last_seen': endpoint_status['last_seen'],
            'malware_detected': endpoint_status['malware_count'] > 0,
            'isolation_status': endpoint_status['network_isolated'],
        }

        # 6. Assess severity
        severity = self._calculate_severity(context)
        context['severity'] = severity

        # 7. Generate recommendations
        context['recommendations'] = self._generate_recommendations(context)

        # 8. Create incident ticket
        incident_id = self._create_incident(context)
        context['incident_id'] = incident_id

        return context

    def _calculate_severity(self, context):
        """Calculate incident severity based on indicators"""
        score = 0

        # Threat intelligence match
        if context.get('threat_intel'):
            score += 30

        # Lateral movement detected
        if context.get('lateral_movement'):
            score += 40

        # Malware detected
        if context['endpoint']['malware_detected']:
            score += 30

        # Multiple related events
        if context['related_events'] > 10:
            score += 20

        if score >= 70:
            return 'CRITICAL'
        elif score >= 40:
            return 'HIGH'
        elif score >= 20:
            return 'MEDIUM'
        else:
            return 'LOW'

    def _generate_recommendations(self, context):
        """Generate automated response recommendations"""
        recommendations = []

        if context['endpoint']['malware_detected']:
            recommendations.append({
                'action': 'isolate_endpoint',
                'priority': 'IMMEDIATE',
                'command': f'falcon isolate {context["hostname"]}',
            })

        if context.get('threat_intel'):
            recommendations.append({
                'action': 'block_ip',
                'priority': 'IMMEDIATE',
                'command': f'firewall block {context["source_ip"]}',
            })

        if context.get('lateral_movement'):
            recommendations.append({
                'action': 'reset_credentials',
                'priority': 'HIGH',
                'details': f'Reset credentials for user {context["user"]}',
            })

        return recommendations

Phase 3: Containment

Containment Runbook

# Containment Runbook

## Objective
Stop the incident from spreading while preserving evidence.

## Short-Term Containment

### Network Isolation

**Isolate Compromised System**:
```bash
# Using CrowdStrike Falcon
falcon contain <hostname>

# OR using firewall
iptables -A INPUT -s <compromised_ip> -j DROP
iptables -A OUTPUT -d <compromised_ip> -j DROP

# Verify isolation
ping <compromised_ip>  # Should fail

Block Malicious IPs:

# Using pfSense firewall
pfctl -t malicious_ips -T add <ip_address>

# Using AWS Security Groups
aws ec2 revoke-security-group-ingress \
  --group-id sg-xxxxxxxx \
  --protocol tcp \
  --port 0-65535 \
  --cidr <ip_address>/32

Account Lockout

Disable Compromised Account:

# Active Directory
Disable-ADAccount -Identity <username>

# AWS IAM
aws iam update-user --user-name <username> --no-active

# Force password reset
aws iam update-login-profile \
  --user-name <username> \
  --password-reset-required

Revoke Active Sessions:

# AWS
aws iam delete-access-key --user-name <username> --access-key-id <key>

# Kubernetes
kubectl delete secret <service-account-token>

# Database
REVOKE ALL PRIVILEGES ON *.* FROM 'username'@'host';

Snapshot Evidence

Preserve System State:

# Create disk snapshot (AWS)
aws ec2 create-snapshot \
  --volume-id vol-xxxxxxxx \
  --description "Forensic snapshot - INC-2025-1001"

# Create memory dump
sudo volatility -f /dev/mem dumpfile \
  --output=memory-dump-$(date +%Y%m%d-%H%M%S).raw

# Collect logs
tar -czf logs-$(hostname)-$(date +%Y%m%d).tar.gz /var/log/

Long-Term Containment

Implement Workarounds

Example: Patch not available:

Deploy WAF rules to block exploit attempts
Implement network segmentation
Add enhanced monitoring
Disable vulnerable features
Document all changes

Strengthen Defenses

Post-Containment Hardening:

hardening_checklist:
  network:
    - [ ] Review firewall rules
    - [ ] Update IDS/IPS signatures
    - [ ] Enable additional logging
    - [ ] Implement network segmentation

  authentication:
    - [ ] Force password resets
    - [ ] Enable MFA for all admin accounts
    - [ ] Review access privileges
    - [ ] Audit service accounts

  systems:
    - [ ] Apply security patches
    - [ ] Update antivirus signatures
    - [ ] Harden system configurations
    - [ ] Remove unnecessary services

  monitoring:
    - [ ] Add detection rules
    - [ ] Increase log retention
    - [ ] Set up anomaly detection
    - [ ] Create dashboards

Containment Decision Matrix

Scenario	Immediate Action	Secondary Action
Ransomware detected	Isolate all affected systems	Disable admin accounts
Data exfiltration	Block external IPs	Snapshot systems
Malware outbreak	Isolate patient zero	Update AV signatures
Compromised admin	Disable account	Rotate all credentials
DDoS attack	Enable DDoS protection	Contact ISP
SQL injection	WAF block pattern	Patch vulnerability


## Phase 4: Eradication

### Eradication Runbook

```bash
#!/bin/bash
# eradicate-malware.sh

set -e

HOSTNAME=$1
INCIDENT_ID=$2

echo "[$(date)] Starting eradication for $HOSTNAME (Incident: $INCIDENT_ID)"

# 1. Verify system is isolated
if ping -c 1 $HOSTNAME > /dev/null 2>&1; then
    echo "ERROR: System is not isolated. Aborting."
    exit 1
fi

# 2. Kill malicious processes
echo "Terminating malicious processes..."
ssh $HOSTNAME "pkill -9 -f 'malware_process_name'"

# 3. Remove malware files
echo "Removing malware files..."
ssh $HOSTNAME "
    rm -f /tmp/.hidden_malware
    rm -f /var/tmp/malicious_script.sh
    find /home -name '.crypted' -delete
"

# 4. Remove persistence mechanisms
echo "Removing persistence..."
ssh $HOSTNAME "
    # Remove cron jobs
    crontab -r

    # Remove systemd services
    systemctl disable malicious.service
    rm -f /etc/systemd/system/malicious.service

    # Remove startup scripts
    rm -f /etc/rc.local
"

# 5. Clean registry (if Windows)
if [[ $OS == "Windows" ]]; then
    echo "Cleaning Windows registry..."
    # PowerShell commands to remove registry keys
fi

# 6. Remove backdoors
echo "Checking for backdoors..."
ssh $HOSTNAME "
    # Check for unauthorized SSH keys
    find /home -name 'authorized_keys' -exec cat {} \;

    # Check for setuid binaries
    find / -perm -4000 -type f 2>/dev/null

    # Check for listening ports
    netstat -tlnp
"

# 7. Rebuild if necessary
read -p "Rebuild system from clean image? (y/n) " -n 1 -r
echo
if [[ $REPLY =~ ^[Yy]$ ]]; then
    echo "Rebuilding from clean image..."
    # Trigger rebuild automation
fi

# 8. Verify eradication
echo "Running verification scans..."
ssh $HOSTNAME "
    # Full antivirus scan
    clamscan -r /

    # Rootkit scan
    rkhunter --check

    # Malware scan
    maldet -a /
"

echo "[$(date)] Eradication complete for $HOSTNAME"

Phase 5: Recovery

Recovery Runbook

# Recovery Runbook

## Pre-Recovery Checklist

- [ ] Incident fully eradicated (verified)
- [ ] Systems hardened and patched
- [ ] Credentials rotated
- [ ] Monitoring enhanced
- [ ] Backups verified clean
- [ ] Recovery plan approved

## Recovery Steps

### 1. Restore from Backup

**Verify Backup Integrity**:
```bash
# Check backup date (before incident)
aws s3 ls s3://backups/ | grep $(date -d "7 days ago" +%Y-%m-%d)

# Verify backup hash
sha256sum backup.tar.gz
# Compare with stored hash

# Test restore in isolated environment
restore_backup.sh --verify --isolated

Restore Production Data:

# Database restore
pg_restore -h localhost -U postgres -d production backup.sql

# File restore
aws s3 sync s3://backups/2025-10-01/ /data/restore/

# Verify data integrity
./verify_data_integrity.sh

2. Rebuild Compromised Systems

Infrastructure as Code:

# terraform/production.tf
# Rebuild from clean state

resource "aws_instance" "web_server" {
  ami           = data.aws_ami.hardened_ami.id  # Pre-incident AMI
  instance_type = "t3.medium"

  vpc_security_group_ids = [aws_security_group.web.id]
  subnet_id              = aws_subnet.public.id

  user_data = file("user_data.sh")

  tags = {
    Name = "web-server-rebuilt"
    IncidentRecovery = "INC-2025-1001"
  }
}

3. Gradual Service Restoration

Phased Approach:

recovery_phases:
  phase_1:
    duration: "2 hours"
    services:
      - Internal testing environment
    validation:
      - Functionality tests
      - Security scans
      - Performance tests

  phase_2:
    duration: "4 hours"
    services:
      - Limited user access (10% traffic)
    validation:
      - User acceptance testing
      - Monitoring alerts
      - No anomalies detected

  phase_3:
    duration: "8 hours"
    services:
      - Increased user access (50% traffic)
    validation:
      - Continued monitoring
      - Performance metrics normal
      - Security baseline maintained

  phase_4:
    services:
      - Full production restoration
    validation:
      - All systems operational
      - 24hr monitoring period
      - Incident closed

4. Enhanced Monitoring

Post-Recovery Monitoring:

# alerting-rules.yml
groups:
  - name: post_incident_monitoring
    interval: 1m
    rules:
      - alert: SuspiciousActivityDetected
        expr: rate(failed_login_attempts[5m]) > 5
        for: 5m
        labels:
          severity: high
          incident: INC-2025-1001
        annotations:
          summary: "Suspicious activity detected post-recovery"

      - alert: UnusualNetworkTraffic
        expr: rate(network_bytes_out[5m]) > 1000000000
        for: 5m
        labels:
          severity: high
          incident: INC-2025-1001
        annotations:
          summary: "Unusual network traffic detected"

      - alert: UnauthorizedProcessExecution
        expr: suspicious_process_count > 0
        for: 1m
        labels:
          severity: critical
          incident: INC-2025-1001
        annotations:
          summary: "Unauthorized process detected"

Recovery Validation

All services restored and functional
No signs of attacker presence
Monitoring shows normal patterns
User access restored
Performance metrics normal
Security scans clean
24-hour monitoring period completed


## Phase 6: Post-Incident Activity

### Post-Incident Review Template

```markdown
# Post-Incident Review - INC-2025-1001

## Incident Summary

**Incident ID**: INC-2025-1001
**Date Detected**: 2025-10-08 14:23 UTC
**Date Resolved**: 2025-10-10 09:15 UTC
**Duration**: 42 hours, 52 minutes
**Severity**: HIGH

## Executive Summary

Brief narrative of what happened, impact, and resolution.

## Timeline

| Time (UTC) | Event | Action Taken |
|------------|-------|--------------|
| 2025-10-08 14:23 | SIEM alert: Multiple failed login attempts | Security team notified |
| 2025-10-08 14:35 | Confirmed: Brute force attack in progress | Blocked source IPs, enabled rate limiting |
| 2025-10-08 15:10 | Detected: Successful admin login from suspicious IP | Disabled compromised account, initiated incident response |
| 2025-10-08 15:30 | Analysis: Attacker accessed customer database | Isolated database server, created forensic snapshot |
| ... | ... | ... |

## Root Cause Analysis

**What happened**: [Detailed description]

**Why it happened**:
- Weak password on admin account (no MFA enforced)
- Missing rate limiting on login endpoint
- Delayed alert notification (20 min)

**How it was detected**: SIEM alert for multiple failed logins

## Impact Assessment

**Systems Affected**:
- Web application server (app-prod-01)
- PostgreSQL database (db-prod-01)
- Admin portal

**Data Exposure**:
- Customer PII: ~10,000 records accessed
- Financial data: None
- Health data: None

**Business Impact**:
- Service downtime: 6 hours
- Customer notification required: Yes
- Regulatory reporting required: Yes (GDPR, within 72 hours)
- Estimated cost: $150,000

## What Went Well

1. Incident detected within 12 minutes
2. Response team assembled quickly
3. Clear communication throughout
4. Evidence preserved properly
5. No data exfiltration confirmed

## What Went Wrong

1. MFA not enforced on admin accounts
2. Weak password policy allowed compromise
3. Missing rate limiting enabled brute force
4. Alert notification delay (20 minutes)
5. Recovery took longer than expected (backup issues)

## Action Items

| #  | Action | Owner | Due Date | Status |
|----|--------|-------|----------|--------|
| 1  | Enforce MFA on all admin accounts | Security Team | 2025-10-15 | ✅ Complete |
| 2  | Implement rate limiting on login endpoint | Dev Team | 2025-10-20 | In Progress |
| 3  | Update password policy (15 char minimum) | IT Team | 2025-10-15 | ✅ Complete |
| 4  | Reduce alert notification time to <5 min | Security Team | 2025-10-18 | Planned |
| 5  | Test backup restoration procedures | Ops Team | 2025-10-22 | Planned |
| 6  | Add additional monitoring for admin logins | Security Team | 2025-10-17 | In Progress |
| 7  | Security awareness training (phishing) | HR Team | 2025-11-01 | Planned |

## Lessons Learned

1. **Prevention**: MFA would have prevented this incident entirely
2. **Detection**: Rate limiting would have slowed/stopped attack
3. **Response**: Having a playbook helped coordinate response
4. **Recovery**: Need to test backup procedures more frequently

## Recommendations

### Immediate (0-30 days)
- Complete all action items above
- Conduct security awareness training
- Perform penetration test

### Short-term (30-90 days)
- Implement security automation (SOAR)
- Enhance logging and monitoring
- Update incident response playbook

### Long-term (90+ days)
- Implement zero-trust architecture
- Deploy deception technology
- Conduct red team exercise

## Sign-off

**Prepared by**: Mike Chen, Technical Lead
**Reviewed by**: Sarah Johnson, CISO
**Approved by**: John Smith, CTO

**Date**: 2025-10-12

Key Takeaways

Preparation is everything - Build your playbook before you need it
Practice makes perfect - Run tabletop exercises quarterly
Clear communication - Use templates and defined roles
Document everything - During and after incidents
Learn and improve - Every incident makes you stronger
Automate when possible - Speed matters in incidents
Test your backups - Recovery is only possible with good backups

Resources

Conclusion

A well-crafted incident response playbook is your best defense when incidents occur. It provides structure during chaos, ensures consistent responses, and significantly reduces the time to contain and recover from security incidents.

The playbook I’ve shared here is a starting point. Customize it for your environment, practice it regularly, and update it after every incident. Your future self (during an incident) will thank you.

Remember: It’s not a matter of IF you’ll have an incident, but WHEN. Be prepared.

Published: October 8, 2025

AI & ML

Building an AI Blog Writer: From Topic to Published Post with n8n, Claude, and GitHub

Developer skills

Cutting Cortex LLM Costs by 90%: The Prompt Engineering Playbook

Engineering

Infrastructure as a Fabric: How a Qdrant MCP Server Led Me to Rethink Everything

Enterprise software

Zero-Downtime Database Migrations

News & insights

From Idea to Production in 28 Days

Open Source

Personal AI Operations Memory: Building a Learning System for Git-Ops

Security

Zero-Trust Networking Patterns for Kubernetes Clusters