0️⃣ Runbook Objectives
This runbook helps you:
✅ Quickly identify CPU, I/O, memory, or process issues
✅ Correlate OS metrics with database / application symptoms
✅ Avoid random commands during incidents
✅ Reach root cause, not just symptom relief
1️⃣ Incident Intake (ALWAYS FIRST)
Before touching the system, collect:
| Question | Why |
|---|---|
| What is impacted? (DB, app, batch, login) | Scope |
| Since when? | Time correlation |
| All users or subset? | Severity |
| Any recent changes? | Deploy / patch |
| Error messages? | Symptom confirmation |
📌 Do not skip this step. It saves 30–40% time later.
2️⃣ High‑Level System Health Snapshot (30 seconds)
2.1 Uptime & Load
Focus on:
- Load average vs CPU cores
- Sudden spike timeframe
✅ Load > CPU count → investigate
✅ Load + low CPU → likely I/O wait
2.2 Disk Space (quick sanity)
🚨 Any filesystem ≥ 90% → fix immediately
3️⃣ Real‑Time CPU & Memory View (top)
3.1 CPU Line Interpretation
%Cpu(s): 12 us, 5 sy, 0 ni, 60 id, 23 wa
| Metric | Meaning |
|---|---|
us | App/SQL CPU |
sy | Kernel CPU |
id | Idle |
wa | I/O wait |
🚨 wa > 15% → jump to I/O section
3.2 Process Area
- Sort by CPU:
Shift + P - Sort by MEM:
Shift + M
Note:
- PID
- %CPU
- %MEM
- COMMAND
👉 Take PID(s) for next step.
4️⃣ Exact Process Diagnosis (ps)
✅ Master Command (use this by default)
What to Look For
| Symptom | Indicator |
|---|---|
| High CPU | %cpu, R state |
| Hung process | D state |
| Long running | High etime |
| I/O wait | wchan = io_schedule |
| Zombie | Z |
🔴 Critical: Find Blocked (I/O) Processes
If you see:
ora_*javamysqld
➡️ Storage / filesystem issue almost certain
5️⃣ Kernel Stress View (vmstat)
Key Columns
| Column | Meaning |
|---|---|
r | Runnable processes |
b | Blocked (I/O wait) |
si/so | Swap usage |
wa | I/O wait |
Interpretation Rules
| Observation | Meaning |
|---|---|
b > 0 | Processes stuck in I/O |
wa high | Disk latency |
si/so > 0 | Memory pressure |
r > CPU cores | CPU contention |
6️⃣ Storage Diagnosis (iostat)
Critical Metrics
| Metric | Bad Threshold |
|---|---|
%util | > 80% |
await | > 20 ms |
await >> svctm | Queueing issue |
Conclusions
| Pattern | Root Cause |
|---|---|
| High await + D state | Storage latency |
| High util | Disk saturation |
| NFS disks slow | Network / mount issue |
🚨 DBWR / LGWR in D state = immediate escalation
7️⃣ Memory Focused Check
If:
- RSS very high
- Swap active
➡️ Tune memory / restart leaking service
8️⃣ Oracle / Database‑Specific Quick Checks
8.1 Oracle Processes
8.2 Dangerous Signs
| Process | Issue |
|---|---|
| ora_dbw* in D | Datafile I/O |
| ora_lgwr in D | Redo disk |
| Many ora_w* in D | Parallel I/O stall |
➡️ Do NOT bounce DB blindly
9️⃣ Decision Matrix (Very Important)
| Observation | Action |
|---|---|
| High CPU, no D | Tune app/SQL |
| High wa + D | Storage escalation |
| Z processes | Restart parent |
| Swap active | Add memory / reduce usage |
| Disk full | Cleanup immediately |
🔔 Escalation Triggers
Escalate to Storage / Infra when:
Dstate persists > 5 minutesawait > 50 ms- DB background processes blocked
Escalate to App/DB Team when:
- CPU
us > 80% - Single PID dominating CPU
- SQL identified as hot spot
✅ One‑Glance Incident Command Set
🧠Golden Rule (Remember This)
High load does NOT always mean CPU problem.
D state + wa = storage until proven otherwise.
No comments:
Post a Comment