ORACLE DATABASE PROBLEM AND SOLUTIONS: Production Server/Database/Application troubleshooting Runbook for Issue like CPU, Memory, I/o , Kernel

Monday, April 27, 2026

Production Server/Database/Application troubleshooting Runbook for Issue like CPU, Memory, I/o , Kernel

0️⃣ Runbook Objectives

This runbook helps you:

✅ Quickly identify CPU, I/O, memory, or process issues
✅ Correlate OS metrics with database / application symptoms
✅ Avoid random commands during incidents
✅ Reach root cause, not just symptom relief

1️⃣ Incident Intake (ALWAYS FIRST)

Before touching the system, collect:

Question	Why
What is impacted? (DB, app, batch, login)	Scope
Since when?	Time correlation
All users or subset?	Severity
Any recent changes?	Deploy / patch
Error messages?	Symptom confirmation

📌 Do not skip this step. It saves 30–40% time later.

2️⃣ High‑Level System Health Snapshot (30 seconds)

2.1 Uptime & Load

uptime

Focus on:

Load average vs CPU cores
Sudden spike timeframe

✅ Load > CPU count → investigate
✅ Load + low CPU → likely I/O wait

2.2 Disk Space (quick sanity)

df -h

🚨 Any filesystem ≥ 90% → fix immediately

3️⃣ Real‑Time CPU & Memory View (`top`)

top

3.1 CPU Line Interpretation

%Cpu(s): 12 us, 5 sy, 0 ni, 60 id, 23 wa

Metric	Meaning
`us`	App/SQL CPU
`sy`	Kernel CPU
`id`	Idle
`wa`	I/O wait

🚨 wa > 15% → jump to I/O section

3.2 Process Area

Sort by CPU: Shift + P
Sort by MEM: Shift + M

Note:

PID
%CPU
%MEM
COMMAND

👉 Take PID(s) for next step.

4️⃣ Exact Process Diagnosis (`ps`)

✅ Master Command (use this by default)

ps -eo user,pid,ppid,stat,%cpu,%mem,etime,wchan,comm --sort=-%cpu

What to Look For

Symptom	Indicator
High CPU	`%cpu`, `R` state
Hung process	`D` state
Long running	High `etime`
I/O wait	`wchan` = io_schedule
Zombie	`Z`

🔴 Critical: Find Blocked (I/O) Processes

ps -eo pid,stat,wchan,etime,comm | awk '$2 ~ /D/'

If you see:

ora_*
java
mysqld

➡️ Storage / filesystem issue almost certain

5️⃣ Kernel Stress View (`vmstat`)

vmstat 1 5

Key Columns

Column	Meaning
`r`	Runnable processes
`b`	Blocked (I/O wait)
`si/so`	Swap usage
`wa`	I/O wait

Interpretation Rules

Observation	Meaning
`b > 0`	Processes stuck in I/O
`wa high`	Disk latency
`si/so > 0`	Memory pressure
`r > CPU cores`	CPU contention

6️⃣ Storage Diagnosis (`iostat`)

iostat -xz 1 5

Critical Metrics

Metric	Bad Threshold
`%util`	> 80%
`await`	> 20 ms
`await >> svctm`	Queueing issue

Conclusions

Pattern	Root Cause
High await + D state	Storage latency
High util	Disk saturation
NFS disks slow	Network / mount issue

🚨 DBWR / LGWR in D state = immediate escalation

7️⃣ Memory Focused Check

ps -eo pid,stat,rss,vsz,%mem,comm --sort=-rss | head

If:

RSS very high
Swap active

➡️ Tune memory / restart leaking service

8️⃣ Oracle / Database‑Specific Quick Checks

8.1 Oracle Processes

ps -eo pid,stat,%cpu,wchan,etime,comm | grep ora_

8.2 Dangerous Signs

Process	Issue
ora_dbw* in D	Datafile I/O
ora_lgwr in D	Redo disk
Many ora_w* in D	Parallel I/O stall

➡️ Do NOT bounce DB blindly

9️⃣ Decision Matrix (Very Important)

Observation	Action
High CPU, no D	Tune app/SQL
High wa + D	Storage escalation
Z processes	Restart parent
Swap active	Add memory / reduce usage
Disk full	Cleanup immediately

🔔 Escalation Triggers

Escalate to Storage / Infra when:

D state persists > 5 minutes
await > 50 ms
DB background processes blocked

Escalate to App/DB Team when:

CPU us > 80%
Single PID dominating CPU
SQL identified as hot spot

✅ One‑Glance Incident Command Set