Monday, April 27, 2026

Production Server/Database/Application troubleshooting Runbook for Issue like CPU, Memory, I/o , Kernel

 

0️⃣ Runbook Objectives

This runbook helps you:

✅ Quickly identify CPU, I/O, memory, or process issues
✅ Correlate OS metrics with database / application symptoms
✅ Avoid random commands during incidents
✅ Reach root cause, not just symptom relief


1️⃣ Incident Intake (ALWAYS FIRST)

Before touching the system, collect:

QuestionWhy
What is impacted? (DB, app, batch, login)Scope
Since when?Time correlation
All users or subset?Severity
Any recent changes?Deploy / patch
Error messages?Symptom confirmation

📌 Do not skip this step. It saves 30–40% time later.


2️⃣ High‑Level System Health Snapshot (30 seconds)

2.1 Uptime & Load

uptime

Focus on:

  • Load average vs CPU cores
  • Sudden spike timeframe

✅ Load > CPU count → investigate
✅ Load + low CPU → likely I/O wait


2.2 Disk Space (quick sanity)

df -h

🚨 Any filesystem ≥ 90% → fix immediately


3️⃣ Real‑Time CPU & Memory View (top)

top

3.1 CPU Line Interpretation

%Cpu(s): 12 us, 5 sy, 0 ni, 60 id, 23 wa
MetricMeaning
usApp/SQL CPU
syKernel CPU
idIdle
waI/O wait

🚨 wa > 15% → jump to I/O section


3.2 Process Area

  • Sort by CPU: Shift + P
  • Sort by MEM: Shift + M

Note:

  • PID
  • %CPU
  • %MEM
  • COMMAND

👉 Take PID(s) for next step.


4️⃣ Exact Process Diagnosis (ps)

✅ Master Command (use this by default)

ps -eo user,pid,ppid,stat,%cpu,%mem,etime,wchan,comm --sort=-%cpu

What to Look For

SymptomIndicator
High CPU%cpu, R state
Hung processD state
Long runningHigh etime
I/O waitwchan = io_schedule
ZombieZ

🔴 Critical: Find Blocked (I/O) Processes

ps -eo pid,stat,wchan,etime,comm | awk '$2 ~ /D/'

If you see:

  • ora_*
  • java
  • mysqld

➡️ Storage / filesystem issue almost certain


5️⃣ Kernel Stress View (vmstat)

vmstat 1 5

Key Columns

ColumnMeaning
rRunnable processes
bBlocked (I/O wait)
si/soSwap usage
waI/O wait

Interpretation Rules

ObservationMeaning
b > 0Processes stuck in I/O
wa highDisk latency
si/so > 0Memory pressure
r > CPU coresCPU contention

6️⃣ Storage Diagnosis (iostat)

iostat -xz 1 5

Critical Metrics

MetricBad Threshold
%util> 80%
await> 20 ms
await >> svctmQueueing issue

Conclusions

PatternRoot Cause
High await + D stateStorage latency
High utilDisk saturation
NFS disks slowNetwork / mount issue

🚨 DBWR / LGWR in D state = immediate escalation


7️⃣ Memory Focused Check

ps -eo pid,stat,rss,vsz,%mem,comm --sort=-rss | head

If:

  • RSS very high
  • Swap active

➡️ Tune memory / restart leaking service


8️⃣ Oracle / Database‑Specific Quick Checks

8.1 Oracle Processes

ps -eo pid,stat,%cpu,wchan,etime,comm | grep ora_

8.2 Dangerous Signs

ProcessIssue
ora_dbw* in DDatafile I/O
ora_lgwr in DRedo disk
Many ora_w* in DParallel I/O stall

➡️ Do NOT bounce DB blindly


9️⃣ Decision Matrix (Very Important)

ObservationAction
High CPU, no DTune app/SQL
High wa + DStorage escalation
Z processesRestart parent
Swap activeAdd memory / reduce usage
Disk fullCleanup immediately

🔔 Escalation Triggers

Escalate to Storage / Infra when:

  • D state persists > 5 minutes
  • await > 50 ms
  • DB background processes blocked

Escalate to App/DB Team when:

  • CPU us > 80%
  • Single PID dominating CPU
  • SQL identified as hot spot

✅ One‑Glance Incident Command Set

uptime
top
ps -eo pid,stat,%cpu,%mem,wchan,comm --sort=-%cpu
vmstat 1 5
iostat -xz 1 5

🧠 Golden Rule (Remember This)

High load does NOT always mean CPU problem.
D state + wa = storage until proven otherwise.

No comments:

Post a Comment

Production Server/Database/Application troubleshooting Runbook for Issue like CPU, Memory, I/o , Kernel

  0️⃣ Runbook Objectives This runbook helps you: ✅ Quickly identify CPU, I/O, memory, or process issues ✅ Correlate OS metrics with database...