Handling High CPU Across AWS EC2, Azure VMs, and Google Compute Engine

Contents

Summary: Multi-cloud response playbook for diagnosing and mitigating unexpected CPU spikes in AWS EC2, Azure VMs, and GCP Compute Engine. Covers metrics, process inspection, scaling strategies, rollback techniques, and prevention. If not handled quickly, high CPU can stall services, degrade APIs, and consume budget due to overprovisioning. This guide equips engineers with proven methods to triage, resolve, and harden infrastructure against repeat incidents.

1. Introduction

High CPU usage is one of the most common infrastructure bottlenecks and often a signal of cascading problems, from code regressions to scaling misconfigurations. This playbook standardizes how teams should respond across AWS EC2, Azure Virtual Machines, and Google Compute Engine (GCE) environments. By implementing this guidance, organizations can reduce downtime, speed up mean time to resolution (MTTR), and ensure SRE and DevOps teams have confidence in their response flow.

2. Purpose

This playbook defines a repeatable approach for diagnosing and resolving high CPU utilization across AWS, Azure, and GCP virtual machines. It aims to improve incident response speed, reduce user impact, and document effective remediation strategies for teams operating in multi-cloud environments.

3. Scope

This guidance applies to workloads deployed on EC2, Azure Virtual Machines, and Google Compute Engine, where elevated CPU load may affect application availability, latency, or cost. It is designed for production and performance environments and assumes infrastructure observability is in place.

4. Definitions

High CPU: Sustained CPU usage above a defined threshold (typically >80%) that impacts application performance.
Blast Radius: The impact zone of a resource failure or degradation (users, services, APIs).
Hot Process: The process on a VM consuming the majority of CPU cycles, potentially causing throttling or stalls.

5. Prerequisites

Cloud monitoring agents enabled (CloudWatch Agent, Azure Monitor Agent, Ops Agent)
IAM/Role-based access to view logs, metrics, and restart services
Terminal or console access to VMs (SSM for AWS, Bastion for Azure, IAP SSH for GCP)
Knowledge of normal CPU patterns for key services and applications
Access to a shared team runbook or dashboard for escalation procedures

6. Triage Workflow

Detect the anomaly
- AWS: CloudWatch alarms based on CPUUtilization + SNS notification
- Azure: Azure Monitor alerts on Percentage CPU + email/webhook
- GCP: Cloud Monitoring alerts on CPU usage + Pub/Sub or email notification
Access the affected host
- AWS: Connect via Session Manager or EC2 Instance Connect
- Azure: Use Bastion or temporary inbound access
- GCP: Use IAP tunnel or gcloud compute ssh
Profile the system
```
top -o %CPU
htop
ps -eo pid,ppid,cmd,%mem,%cpu --sort=-%cpu | head -10
```
Use these tools to detect looping apps, memory starvation, or runaway processes.
Correlate with telemetry
- Review APM tools (e.g., New Relic, Dynatrace, Datadog) for service-level insights
- Inspect log ingestion tools (e.g., CloudWatch Logs, Azure Log Analytics, GCP Logs Explorer)
- Review any deployments or config changes in the last 15 minutes

7. Mitigation Options

Kill or restart the offending process (only if stateless or safe to retry)
Perform vertical scaling:
- AWS: Switch to larger EC2 type
- Azure: Resize VM from D2s_v3 to D4s_v3
- GCP: Use instance recommendations or resize to higher CPU tier
Horizontal scaling: Add nodes or increase autoscaling min/max thresholds
Enable load shedding: Implement throttling, connection limits, or failover modes
Drain or rotate instance: For ASGs, VMSS, or MIGs, terminate and replace the host
Notify stakeholders: Communicate mitigation plan in incident Slack or ticket system

8. Root Cause Analysis (RCA)

Review historical metrics before, during, and after the spike
Download logs, memory dumps, and any core files if a crash occurred
Interview app owners: was this expected load, a deploy, or an anomaly?
Document system state (processes, usage, scaling limits) at time of incident
Capture relevant trace IDs or transaction logs for forensic correlation

9. Preventive Actions

Create guardrails: circuit breakers, queue backpressure, and retry caps
Use autoscaling policies based on rolling 5-minute CPU averages
Tag alerts by severity and assign ownership by runbook or Slack channel
Test performance thresholds during load testing or game days
Set up synthetic monitoring probes and alerts for degraded UX
Define clear SLOs for response and resolution based on business impact
Run quarterly cost reviews to ensure CPU rightsizing isn’t overprovisioning compute

10. Diagrams

Multi-cloud CPU Triage Diagram

11. Implementation Checklist

☑️ CPU threshold alerts enabled in all cloud providers
☑️ On-call staff can access affected VMs within 2 minutes
☑️ Metrics dashboard bookmarks for EC2, Azure VMs, and GCE
☑️ RCA template available and linked in the internal knowledge base
☑️ Application owners notified of past spikes and action plans
☑️ Team runbooks updated with recent mitigations or triage improvements

12. References

Updated on May 15, 2025

Was this article helpful?

Yes No