✅ ✅ 1. CPU Troubleshooting Framework (Like iostat for CPU)
Use:
vmstat 2 5
or
top
📊 ✅ 2. CPU Metrics Explained (vmstat / top)
us sy id wa st
| Metric | Meaning |
|---|---|
| us | User CPU (app/DB queries) |
| sy | System CPU (kernel/syscalls) |
| id | Idle CPU |
| wa | IO wait |
| st | Steal (VM contention) |
✅ ✅ 3. CPU Baseline Table (Production Standard)
📊 CPU Utilization
| Metric | Good ✅ | Warning ⚠️ | Critical 🚨 |
|---|---|---|---|
| idle (id) | > 40% | 20–40% | < 20% |
| user (us) | < 50% | 50–70% | > 70% |
| system (sy) | < 20% | 20–30% | > 30% |
| iowait (wa) | < 5% | 5–15% | > 20% |
| steal (st) | 0 | 1–5 | > 5 |
📊 Run Queue (Very Important)
r (run queue)
| Metric | Good ✅ | Warning ⚠️ | Critical 🚨 |
|---|---|---|---|
| r vs CPU cores | ≤ cores | 1.5x cores | 2x+ cores |
✅ ✅ 4. CPU Health Decision Matrix
| Condition | Meaning |
|---|---|
| High us | CPU-heavy queries |
| High sy | kernel/system overhead |
| High wa | not CPU → disk issue |
| High r | CPU contention |
| High st | VM issue |
✅ ✅ 5. CPU Scoring Model (Like Disk)
🎯 Formula
Score = 100
- user penalty
- system penalty
- run queue penalty
- iowait penalty
📊 Penalties
✅ User CPU
- <50 → 0
- 50–70 → -15
70 → -30
✅ System CPU
- <20 → 0
- 20–30 → -15
30 → -30
✅ Run Queue
- ≤ cores → 0
- 1–2x cores → -20
2x cores → -40
✅ IO Wait
- <5 → 0
- 5–15 → -20
15 → -40
✅ Final Status
| Score | Status |
|---|---|
| 80–100 | ✅ Healthy |
| 60–80 | ⚠️ Warning |
| 40–60 | 🔶 Degraded |
| <40 | 🚨 Critical |
✅ ✅ 6. CPU Monitoring Script (PRODUCTION READY)
📜 cpu_health.sh
#!/bin/bash
echo "===== CPU Health Check ====="
date
vmstat 2 3 | tail -1 | awk '
{
r=$1
us=$13
sy=$14
id=$15
wa=$16
st=$17
score=100
# user penalty
if (us > 70) score -= 30
else if (us > 50) score -= 15
# system penalty
if (sy > 30) score -= 30
else if (sy > 20) score -= 15
# iowait penalty
if (wa > 15) score -= 40
else if (wa > 5) score -= 20
# status
status="HEALTHY"
if (score < 40) status="CRITICAL"
else if (score < 60) status="DEGRADED"
else if (score < 80) status="WARNING"
printf "RunQ=%d User=%d%% Sys=%d%% Idle=%d%% IOwait=%d%% Status=%s Score=%d\n",r,us,sy,id,wa,status,score
}
'
Identify CPU Bottleneck (Root Cause)✅ Step 1: Check top processes
✅ Step 2: Per process CPU
✅ Step 3: Thread-level (VERY IMPORTANT for DB)
✅ Step 4: Per process breakdown
✅ ✅ 8. CPU Issue Patterns (DBA Mapping)
🔴 Pattern 1: High USER CPU
us > 70%
👉 Cause:
- Complex SQL
- Full scans
- bad execution plans
✅ DB Mapping:
🔴 Pattern 2: High SYSTEM CPU
sy > 30%
👉 Cause:
- excessive context switching
- kernel overhead
- I/O interrupts
✅ Check:
vmstat 1
Look at:
cs (context switches)
in (interrupts)
🔴 Pattern 3: High RUN QUEUE
r >> CPU cores
👉 Cause:
- CPU contention
- too many concurrent queries
🔴 Pattern 4: High IO WAIT
wa > 20%
👉 Cause:
- NOT CPU problem
👉 disk bottleneck (go back to iostat)
🔴 Pattern 5: High STEAL (VM)
st > 5%
👉 Cause:
- host contention (cloud/VM issue)
✅ ✅ 9. CPU Alert Script (Automation)
📜 cpu_alert.sh
#!/bin/bash
HOST=$(hostname)
ALERT=$(vmstat 2 2 | tail -1 | awk '
{
us=$13; sy=$14; wa=$16; st=$17
if (us>70 || sy>30 || wa>15 || st>5)
print "CPU Issue: user="us"% sys="sy"% wa="wa"% st="st"%"
}
')
if [ ! -z "$ALERT" ]; then
echo "$ALERT" | mail -s "CPU Alert on $HOST" your_email@company.com
fi
``✅ ✅ 10. DBA Troubleshooting Flow
🔥 When CPU alert triggers:
1. Check:
2. Identify pattern:
- High us → SQL issue
- High sy → OS/kernel
- High wa → disk issue
3. Identify process:
4. Map to DB:
5. Fix:
- Tune query ✅
- Index ✅
- Reduce parallelism ✅
- Limit sessions ✅
✅ ✅ 11. Golden Rules (CPU)
✅ Healthy
idle > 40%
wa < 5%
r <= cores
🚨 Critical CPU
idle < 10%
us > 70%
r > 2x cores
⚠️ Trick Case (VERY IMPORTANT)
CPU looks high BUT wa is high
👉 NOT CPU issue → DISK issue
🎯 Final Outcome
Now you have:
✅ CPU baseline table
✅ Health scoring model
✅ Monitoring script
✅ Alerting system
✅ DB correlation
✅ Troubleshooting flow