Operational incident response for triage, containment, communications, recovery, and postmortems. Use when coordinating outages or service degradation.
Operational incident response for triage, containment, communications, recovery, and postmortems. Use when coordinating outages or service degradation. NOT for code review or proactive security scanning.
release-pipeline-architectRelease workflow architecture for versioning, artifact promotion, rollout safety, and rollback design. Use for release pipelines.
security-scannerProactive security assessment with SAST, secrets detection, dependency scanning, and compliance checks. Use for pre-deployment audit.
honest-reviewReview code with confidence-scored evidence. Session, scoped, PR, or full audit; optional approved fix pass. Use when reviewing changes or quality.
View Full SKILL.md
SKILL.md
---
name: incident-response-engineer
description: >-
Operational incident response for triage, containment, communications,
recovery, and postmortems. Use when coordinating outages or service
degradation. NOT for code review or proactive security scanning.
argument-hint: "<mode> [incident]"
license: MIT
metadata:
author: wyattowalsh
version: "1.0.0"
---
# Incident Response Engineer
Coordinate production incident response from first signal through recovery and
postmortem.
**Scope:** Live operational incidents and service degradation. NOT for general
- A service is down, degraded, or violating its SLO
- Multiple responders need a common incident structure
- Stakeholder or customer updates must be issued on a cadence
- A fix is known but risk must be managed during containment and recovery
- The team needs a postmortem or tabletop exercise
## Classification Gate
- If the task is routine debugging or a one-off bug with no operational impact,
use investigate.
- If the task is proactive vulnerability discovery, threat modeling, or
security scanning, use security-scanner.
- If the task is code review, fix quality assessment, or pre-merge risk review,
use honest-review.
- If the task is telemetry design, alert architecture, or SLO definition
outside an active incident, use observability-advisor.
- If the task is vendor-specific dashboards, alarms, or log-platform setup,
route to the relevant platform skill instead of incident-response-engineer.
## Instructions
### Mode: Triage
1. Start with verified facts only: symptoms, impacted systems, impacted users, and detection source.
2. Estimate severity from impact and blast radius, not gut feel. Use `references/severity-matrix.md`.
3. Name an incident commander and define the next decision checkpoint.
4. Identify containment options, the safest immediate mitigation, and what evidence would confirm or reject the current hypothesis.
5. Produce the first 15-minute response plan.
### Mode: Stabilize
1. Separate containment from root cause work.
2. Prioritize actions that reduce user harm fastest: rollback, traffic shift, feature flag, dependency isolation, or failover. Use `references/containment-recovery-aids.md`.
3. Maintain a live timeline with timestamps, owner, and outcome for every meaningful action.
3. Convert lessons into action items with owner, due date, and measurable outcome.
4. Focus on system fixes, not blame.
### Mode: Review
1. Read the timeline, updates, runbook, and follow-up actions. Compare against `references/severity-matrix.md`, `references/comms-templates.md`, and `references/timeline-postmortem-examples.md`.
1. Define scenario, objective, and stop condition.
2. Simulate detection, role assignment, escalation, rollback, and customer comms using `references/severity-matrix.md`, `references/comms-templates.md`, and `references/containment-recovery-aids.md`.
3. Capture decision bottlenecks and missing runbook steps.
## Output Requirements
- Triage and stabilize outputs must include severity, blast radius, commander, next checkpoint, and immediate actions.
- Comms outputs must include audience and next update time.
- Postmortems must include action items with owners.
## Critical Rules
1. Always distinguish fact, inference, and hypothesis.
2. Customer impact takes priority over elegant diagnosis.
3. Never claim a root cause publicly before the evidence supports it.
4. Every live incident needs a single incident commander.
5. Every meaningful action during response must land in the timeline.
6. Postmortems must produce owned corrective actions, not vague lessons.
## Scaling Strategy
- For a single-service incident, keep one commander, one timeline, one
response channel, and one short update cadence.
- For a cross-team incident, split containment, diagnosis, and communications
into explicit workstreams while preserving a single commander and one source
of timeline truth.
- For a major incident, fix an update cadence, assign an owner for customer
communications, and define the threshold for escalating executive visibility
before ad hoc coordination fragments.
## Reference Files
Use these only when the current mode needs deeper structure or reusable
templates.
| File | Purpose | Use when |
|------|---------|----------|
| `references/severity-matrix.md` | Severity calibration by impact, blast radius, and response posture | Triage, stabilize, drill, review |
| `references/comms-templates.md` | Internal, executive, support, and customer update templates with cadence guidance | Comms, stabilize, drill, review |