This after-action report (AAR) template helps teams document cloud outages clearly and consistently. It is designed to support root cause analysis, highlight recovery efforts, and ensure leadership and engineering teams learn from every incident.
Why Use an AAR?
- Prevent repeat outages by addressing systemic gaps
- Establish operational transparency across teams
- Create documentation that supports audits and continuous improvement
📄 AAR Template
1. Summary
One paragraph describing what happened, how long it lasted, and who was affected.
2. Timeline
List events in order (with timestamps): detection, escalation, mitigation steps, and resolution.
00:04 - Alert fired in PagerDuty (503 errors from service A)
00:06 - Engineer on-call acknowledged
00:10 - Correlated CloudWatch logs show increased error rate on new deploy
00:12 - Rolled back release to previous image
00:16 - Service health restored
3. Impact
- Which services were affected?
- How many users were impacted?
- What was the duration?
- Estimated cost or operational disruption?
4. Root Cause
What triggered the outage? Include environment, misconfigurations, or human error. Keep blame-free, technical, and specific.
5. Resolution
What resolved the incident? Note whether mitigation was manual or automated, and include rollback procedures or emergency patches used.
6. Lessons Learned
- What went well?
- What didn’t go well?
- What surprised the team?
7. Action Items
Include owner, priority, and timeline:
- Add pre-deploy alert for spike in 503s (Owner: Ops, Due: 7d)
- Move release script to gated rollout (Owner: Eng, Due: 14d)
- Conduct IAM permissions audit (Owner: SecOps, Due: 30d)
Best Practices
- Hold the AAR within 48 hours of the incident
- Include SRE, Dev, Security, and PM stakeholders
- Keep the report public internally, and anonymize externally if shared
📥 View Markdown Template
Click to expand the full After-Action Report (AAR) Markdown template
# 📄 Cloud Outage After-Action Report (AAR)
## 1. Summary
_What happened, how long it lasted, who was affected._
---
## 2. Timeline
| Time | Event |
|--------|----------------------------------------|
| 00:04 | Alert fired in PagerDuty (503 errors) |
| 00:06 | Engineer on-call acknowledged |
| 00:10 | Error correlation in logs |
| 00:12 | Rolled back to previous deploy |
| 00:16 | Service restored |
---
## 3. Impact
- Affected Services:
- User impact:
- Outage Duration:
- Operational cost (if any):
---
## 4. Root Cause
_What caused the incident (technical focus, no blame)._
---
## 5. Resolution
_What resolved the issue? Rollbacks, patches, scaling, etc._
---
## 6. Lessons Learned
- ✅ What went well:
- ❌ What didn’t:
- 🔍 What surprised us:
---
## 7. Action Items
| Item | Owner | Priority | Due |
|------------------------------------------|-----------|----------|----------|
| Add 503 alert pre-deploy check | Ops | High | 7d |
| Gated rollout for deploys | Engineering | Medium | 14d |
| IAM audit across services | Security | Medium | 30d |
---
_Use this template for internal postmortems and share anonymized versions externally when needed._