After-Action Template for Cloud Outages

Contents

This after-action report (AAR) template helps teams document cloud outages clearly and consistently. It is designed to support root cause analysis, highlight recovery efforts, and ensure leadership and engineering teams learn from every incident.

Why Use an AAR?

Prevent repeat outages by addressing systemic gaps
Establish operational transparency across teams
Create documentation that supports audits and continuous improvement

📄 AAR Template

1. Summary

One paragraph describing what happened, how long it lasted, and who was affected.

2. Timeline

List events in order (with timestamps): detection, escalation, mitigation steps, and resolution.


00:04 - Alert fired in PagerDuty (503 errors from service A)
00:06 - Engineer on-call acknowledged
00:10 - Correlated CloudWatch logs show increased error rate on new deploy
00:12 - Rolled back release to previous image
00:16 - Service health restored

3. Impact

Which services were affected?
How many users were impacted?
What was the duration?
Estimated cost or operational disruption?

4. Root Cause

What triggered the outage? Include environment, misconfigurations, or human error. Keep blame-free, technical, and specific.

5. Resolution

What resolved the incident? Note whether mitigation was manual or automated, and include rollback procedures or emergency patches used.

6. Lessons Learned

What went well?
What didn’t go well?
What surprised the team?

7. Action Items

Include owner, priority, and timeline:


- Add pre-deploy alert for spike in 503s (Owner: Ops, Due: 7d)
- Move release script to gated rollout (Owner: Eng, Due: 14d)
- Conduct IAM permissions audit (Owner: SecOps, Due: 30d)

Best Practices

Hold the AAR within 48 hours of the incident
Include SRE, Dev, Security, and PM stakeholders
Keep the report public internally, and anonymize externally if shared

📥 View Markdown Template

Click to expand the full After-Action Report (AAR) Markdown template

# 📄 Cloud Outage After-Action Report (AAR)

## 1. Summary
_What happened, how long it lasted, who was affected._

---

## 2. Timeline

| Time   | Event                                  |
|--------|----------------------------------------|
| 00:04  | Alert fired in PagerDuty (503 errors)  |
| 00:06  | Engineer on-call acknowledged          |
| 00:10  | Error correlation in logs              |
| 00:12  | Rolled back to previous deploy         |
| 00:16  | Service restored                       |

---

## 3. Impact
- Affected Services:
- User impact:
- Outage Duration:
- Operational cost (if any):

---

## 4. Root Cause
_What caused the incident (technical focus, no blame)._

---

## 5. Resolution
_What resolved the issue? Rollbacks, patches, scaling, etc._

---

## 6. Lessons Learned
- ✅ What went well:
- ❌ What didn’t:
- 🔍 What surprised us:

---

## 7. Action Items

| Item                                      | Owner     | Priority | Due      |
|------------------------------------------|-----------|----------|----------|
| Add 503 alert pre-deploy check           | Ops       | High     | 7d       |
| Gated rollout for deploys                | Engineering | Medium  | 14d      |
| IAM audit across services                | Security  | Medium   | 30d      |

---

_Use this template for internal postmortems and share anonymized versions externally when needed._

Updated on May 8, 2025

Was this article helpful?

Yes No