Summary: Multi-cloud response playbook for diagnosing and mitigating unexpected CPU spikes in AWS EC2, Azure VMs, and GCP Compute Engine. Covers metrics, process inspection, scaling strategies, rollback techniques, and prevention. If not handled quickly, high CPU can stall services, degrade APIs, and consume budget due to overprovisioning. This guide equips engineers with proven methods to triage, resolve, and harden infrastructure against repeat incidents.
1. Introduction
High CPU usage is one of the most common infrastructure bottlenecks and often a signal of cascading problems, from code regressions to scaling misconfigurations. This playbook standardizes how teams should respond across AWS EC2, Azure Virtual Machines, and Google Compute Engine (GCE) environments. By implementing this guidance, organizations can reduce downtime, speed up mean time to resolution (MTTR), and ensure SRE and DevOps teams have confidence in their response flow.
2. Purpose
This playbook defines a repeatable approach for diagnosing and resolving high CPU utilization across AWS, Azure, and GCP virtual machines. It aims to improve incident response speed, reduce user impact, and document effective remediation strategies for teams operating in multi-cloud environments.
3. Scope
This guidance applies to workloads deployed on EC2, Azure Virtual Machines, and Google Compute Engine, where elevated CPU load may affect application availability, latency, or cost. It is designed for production and performance environments and assumes infrastructure observability is in place.
4. Definitions
- High CPU: Sustained CPU usage above a defined threshold (typically >80%) that impacts application performance.
- Blast Radius: The impact zone of a resource failure or degradation (users, services, APIs).
- Hot Process: The process on a VM consuming the majority of CPU cycles, potentially causing throttling or stalls.
5. Prerequisites
- Cloud monitoring agents enabled (CloudWatch Agent, Azure Monitor Agent, Ops Agent)
- IAM/Role-based access to view logs, metrics, and restart services
- Terminal or console access to VMs (SSM for AWS, Bastion for Azure, IAP SSH for GCP)
- Knowledge of normal CPU patterns for key services and applications
- Access to a shared team runbook or dashboard for escalation procedures
6. Triage Workflow
- Detect the anomaly
- AWS: CloudWatch alarms based on CPUUtilization + SNS notification
- Azure: Azure Monitor alerts on Percentage CPU + email/webhook
- GCP: Cloud Monitoring alerts on CPU usage + Pub/Sub or email notification
- Access the affected host
- AWS: Connect via Session Manager or EC2 Instance Connect
- Azure: Use Bastion or temporary inbound access
- GCP: Use IAP tunnel or gcloud compute ssh
- Profile the system
top -o %CPU htop ps -eo pid,ppid,cmd,%mem,%cpu --sort=-%cpu | head -10
Use these tools to detect looping apps, memory starvation, or runaway processes.
- Correlate with telemetry
- Review APM tools (e.g., New Relic, Dynatrace, Datadog) for service-level insights
- Inspect log ingestion tools (e.g., CloudWatch Logs, Azure Log Analytics, GCP Logs Explorer)
- Review any deployments or config changes in the last 15 minutes
7. Mitigation Options
- Kill or restart the offending process (only if stateless or safe to retry)
- Perform vertical scaling:
- AWS: Switch to larger EC2 type
- Azure: Resize VM from D2s_v3 to D4s_v3
- GCP: Use instance recommendations or resize to higher CPU tier
- Horizontal scaling: Add nodes or increase autoscaling min/max thresholds
- Enable load shedding: Implement throttling, connection limits, or failover modes
- Drain or rotate instance: For ASGs, VMSS, or MIGs, terminate and replace the host
- Notify stakeholders: Communicate mitigation plan in incident Slack or ticket system
8. Root Cause Analysis (RCA)
- Review historical metrics before, during, and after the spike
- Download logs, memory dumps, and any core files if a crash occurred
- Interview app owners: was this expected load, a deploy, or an anomaly?
- Document system state (processes, usage, scaling limits) at time of incident
- Capture relevant trace IDs or transaction logs for forensic correlation
9. Preventive Actions
- Create guardrails: circuit breakers, queue backpressure, and retry caps
- Use autoscaling policies based on rolling 5-minute CPU averages
- Tag alerts by severity and assign ownership by runbook or Slack channel
- Test performance thresholds during load testing or game days
- Set up synthetic monitoring probes and alerts for degraded UX
- Define clear SLOs for response and resolution based on business impact
- Run quarterly cost reviews to ensure CPU rightsizing isn’t overprovisioning compute
10. Diagrams
11. Implementation Checklist
- ☑️ CPU threshold alerts enabled in all cloud providers
- ☑️ On-call staff can access affected VMs within 2 minutes
- ☑️ Metrics dashboard bookmarks for EC2, Azure VMs, and GCE
- ☑️ RCA template available and linked in the internal knowledge base
- ☑️ Application owners notified of past spikes and action plans
- ☑️ Team runbooks updated with recent mitigations or triage improvements