Monday, April 27, 2026

Production Server/Database/Application troubleshooting Runbook for Issue like CPU, Memory, I/o , Kernel

 

0️⃣ Runbook Objectives

This runbook helps you:

✅ Quickly identify CPU, I/O, memory, or process issues
✅ Correlate OS metrics with database / application symptoms
✅ Avoid random commands during incidents
✅ Reach root cause, not just symptom relief


1️⃣ Incident Intake (ALWAYS FIRST)

Before touching the system, collect:

QuestionWhy
What is impacted? (DB, app, batch, login)Scope
Since when?Time correlation
All users or subset?Severity
Any recent changes?Deploy / patch
Error messages?Symptom confirmation

📌 Do not skip this step. It saves 30–40% time later.


2️⃣ High‑Level System Health Snapshot (30 seconds)

2.1 Uptime & Load

uptime

Focus on:

  • Load average vs CPU cores
  • Sudden spike timeframe

✅ Load > CPU count → investigate
✅ Load + low CPU → likely I/O wait


2.2 Disk Space (quick sanity)

df -h

🚨 Any filesystem ≥ 90% → fix immediately


3️⃣ Real‑Time CPU & Memory View (top)

top

3.1 CPU Line Interpretation

%Cpu(s): 12 us, 5 sy, 0 ni, 60 id, 23 wa
MetricMeaning
usApp/SQL CPU
syKernel CPU
idIdle
waI/O wait

🚨 wa > 15% → jump to I/O section


3.2 Process Area

  • Sort by CPU: Shift + P
  • Sort by MEM: Shift + M

Note:

  • PID
  • %CPU
  • %MEM
  • COMMAND

👉 Take PID(s) for next step.


4️⃣ Exact Process Diagnosis (ps)

✅ Master Command (use this by default)

ps -eo user,pid,ppid,stat,%cpu,%mem,etime,wchan,comm --sort=-%cpu

What to Look For

SymptomIndicator
High CPU%cpu, R state
Hung processD state
Long runningHigh etime
I/O waitwchan = io_schedule
ZombieZ

🔴 Critical: Find Blocked (I/O) Processes

ps -eo pid,stat,wchan,etime,comm | awk '$2 ~ /D/'

If you see:

  • ora_*
  • java
  • mysqld

➡️ Storage / filesystem issue almost certain


5️⃣ Kernel Stress View (vmstat)

vmstat 1 5

Key Columns

ColumnMeaning
rRunnable processes
bBlocked (I/O wait)
si/soSwap usage
waI/O wait

Interpretation Rules

ObservationMeaning
b > 0Processes stuck in I/O
wa highDisk latency
si/so > 0Memory pressure
r > CPU coresCPU contention

6️⃣ Storage Diagnosis (iostat)

iostat -xz 1 5

Critical Metrics

MetricBad Threshold
%util> 80%
await> 20 ms
await >> svctmQueueing issue

Conclusions

PatternRoot Cause
High await + D stateStorage latency
High utilDisk saturation
NFS disks slowNetwork / mount issue

🚨 DBWR / LGWR in D state = immediate escalation


7️⃣ Memory Focused Check

ps -eo pid,stat,rss,vsz,%mem,comm --sort=-rss | head

If:

  • RSS very high
  • Swap active

➡️ Tune memory / restart leaking service


8️⃣ Oracle / Database‑Specific Quick Checks

8.1 Oracle Processes

ps -eo pid,stat,%cpu,wchan,etime,comm | grep ora_

8.2 Dangerous Signs

ProcessIssue
ora_dbw* in DDatafile I/O
ora_lgwr in DRedo disk
Many ora_w* in DParallel I/O stall

➡️ Do NOT bounce DB blindly


9️⃣ Decision Matrix (Very Important)

ObservationAction
High CPU, no DTune app/SQL
High wa + DStorage escalation
Z processesRestart parent
Swap activeAdd memory / reduce usage
Disk fullCleanup immediately

🔔 Escalation Triggers

Escalate to Storage / Infra when:

  • D state persists > 5 minutes
  • await > 50 ms
  • DB background processes blocked

Escalate to App/DB Team when:

  • CPU us > 80%
  • Single PID dominating CPU
  • SQL identified as hot spot

✅ One‑Glance Incident Command Set

uptime
top
ps -eo pid,stat,%cpu,%mem,wchan,comm --sort=-%cpu
vmstat 1 5
iostat -xz 1 5

🧠 Golden Rule (Remember This)

High load does NOT always mean CPU problem.
D state + wa = storage until proven otherwise.

Step by step troubleshoot performance issue Linux , Oracle - CPU , Memory, I/O

 

🔍 Correlation of ps, top, iostat, and vmstat

Mental Model (Very Important)

ToolAnswers the Question
psWhich exact process is responsible?
topIs the problem CPU or memory pressure right now?
iostatIs storage slow or saturated?
vmstatIs the kernel under memory / run‑queue / I/O stress?

👉 Never use only one tool.
Real root cause comes from correlating outputs.


1️⃣ top — Real‑Time CPU & Memory Pressure

Best usage

top

(or top -o %CPU on newer systems)

What to focus on (top header)

%Cpu(s): 85.2 us, 10.1 sy,  0.0 ni,  2.0 id,  2.5 wa
FieldMeaning
usUser CPU (app / DB code)
syKernel CPU
idIdle CPU
waI/O wait (very important)

Interpretation

  • ✅ High us → application or SQL CPU
  • ✅ High sy → kernel, system calls, networking
  • 🚨 High wastorage problem, not CPU

Process section (top half)

PID   USER  %CPU  %MEM  COMMAND
2456  oracle 180.2 12.3 ora_dbw0

Now you jump to ps.


2️⃣ ps — Identify the Exact Culprit

Correlate with:

ps -eo pid,ppid,stat,%cpu,%mem,etime,wchan,comm | sort -k5 -nr | head

Key correlation

top showsps confirms
High CPU PID%cpu, etime
Hung processSTAT = D
Storage waitwchan = io_schedule

🚨 Example:

2456  D  io_schedule  ora_dbw0

👉 This tells you:
DBWR is blocked on disk I/O

Now you must check storage.


3️⃣ iostat — Storage Bottleneck Detection

Best command

iostat -xz 1 5

Critical columns

ColumnMeaning
%utilDisk busy time
awaitAvg I/O latency (ms)
svctmDisk service time
r/s w/sRead/write rate

Interpretation Rules (Golden)

SymptomMeaning
%util > 80%Disk saturated
await > 20 msStorage slow
await >> svctmQueueing problem
High writes + DBWR stuckRedo / data disk issue

🚨 Example:

sda  %util=99.8  await=120ms

✅ Confirms ps + topstorage root cause


4️⃣ vmstat — Kernel Stress & Memory I/O

Best command

vmstat 1 5

Key columns

r  b   swpd   free   buff  cache  si so   bi bo   in cs us sy id wa

Important fields explained

ColumnMeaning
rRun queue (CPU demand)
bBlocked processes (I/O)
si/soSwap in/out
bi/boBlock I/O
waI/O wait (kernel view)

Correlation logic

vmstat showsCombined meaning
b > 0Processes stuck in I/O
wa highCPU waiting for disk
r > CPU coresCPU contention
si/so > 0Memory pressure

🚨 Example:

r=1 b=6 wa=40

👉 Matches:

  • ps → many D
  • top → high I/O wait
  • iostat → high disk latency

🎯 Root cause confirmed: storage


5️⃣ End‑to‑End Correlation Scenarios


✅ Scenario A: High Load Average

Observations

  • uptime → load = 20
  • top → CPU idle
  • vmstatb=10, wa=35
  • ps → many D state
  • iostat → high await

Conclusion
Load is from I/O wait, not CPU
👉 Storage team issue


✅ Scenario B: CPU Spike

Observations

  • top%us=90
  • vmstatr > CPU cores
  • ps → process in R state
  • iostat → normal

Conclusion
Pure CPU problem
👉 Tune SQL / app / threads


✅ Scenario C: Hung Oracle Instance

Observations

  • psora_dbw0, ora_lgwr in D
  • vmstatb > 5
  • iostat → redo disk latency
  • top → high wa

Conclusion
Redo or data disk I/O stall
👉 SAN / ASM / NFS issue


6️⃣ Golden Troubleshooting Workflow (Memorize This)


symptom →
top →
ps →
vmstat →
iostat →
root cause

One‑liner sequence

top
ps -eo pid,stat,%cpu,wchan,comm | grep D
vmstat 1 5
iostat -xz 1 5


✅ Final Cheat Sheet

ToolBest for
topLive CPU/memory
psExact process & state
vmstatKernel & wait queues
iostatDisk latency & saturation

🎯 Never trust a single tool
Real diagnosis = correlation

Troubleshooting CPU - I/O : Best linux ps Command Arguments for Troubleshooting - One‑Command “Master View” (Highly Recommended)

 

One‑Command “Master View” (Highly Recommended)

ps -eo user,pid,ppid,stat,%cpu,%mem,etime,lstart,wchan,comm --sort=-%cpu

🔍 What it shows (and why it matters)

FieldWhy it’s important
userWho owns the process
pidProcess ID
ppidParent process (helps detect orphans)
statProcess state (R/S/D/Z/T)
%cpuCPU consumption
%memMemory usage
etimeHow long the process has been running
lstartExact start time
wchanKernel wait channel (I/O diagnosis)
commExecutable name
--sort=-%cpuTop CPU consumers first

This is your best single snapshot for general troubleshooting


2️⃣ CPU Troubleshooting (High CPU / Run Queues)

ps -eo pid,ppid,stat,psr,pri,ni,%cpu,time,comm --sort=-%cpu | head -20

Key columns

ColumnMeaning
psrWhich CPU core it’s running on
priKernel priority
niNice value
timeTotal CPU time consumed

✅ Use when:

  • Load average is high
  • CPU is saturated
  • Performance complaints

3️⃣ I/O Troubleshooting (MOST CRITICAL)

🔥 Identify blocked processes (D state)

ps -eo pid,stat,wchan,%cpu,etime,comm | awk '$2 ~ /D/'

Why this is powerful

FieldPurpose
DUninterruptible sleep (I/O wait)
wchanWhat kernel function it’s stuck on
etimeHow long it has been blocked

Common wchan values and meaning

wchanMeaning
io_scheduleDisk I/O wait
wait_on_page_bitMemory/disk interaction
nfs_waitNFS hang
blk_mq_get_tagStorage queue congestion

🚨 If Oracle or DB processes appear here → storage issue almost guaranteed


4️⃣ Memory & Leak Detection

ps -eo pid,ppid,stat,rss,vsz,%mem,comm --sort=-rss | head -20

Key fields

FieldMeaning
rssReal memory in KB
vszVirtual memory
%memRAM usage

✅ Use when:

  • System is swapping
  • OOM killer events
  • Slow performance despite low CPU

5️⃣ Full Command, Arguments & Environment

ps -eo pid,stat,%cpu,%mem,cmd --sort=-%cpu

Why this matters:

  • cmd shows complete arguments
  • Crucial for:
    • Java tuning
    • Oracle startup flags
    • Application misconfiguration

6️⃣ Zombie Process Detection

ps -eo pid,ppid,stat,etime,comm | awk '$3 ~ /Z/'

Why care?

  • Zombies indicate parent process bug
  • Can exhaust PID space
  • Need parent restart (not kill)

7️⃣ Oracle / Database‑Focused View (DBA Favorite)

ps -eo pid,stat,%cpu,%mem,etime,wchan,comm | grep ora_

✅ Detects:

  • DBWR / LGWR I/O stalls
  • Parallel worker hangs
  • Backup‑related blockages

8️⃣ Thread‑Level Analysis (Advanced CPU Debugging)

ps -eLo pid,lwp,stat,%cpu,psr,comm --sort=-%cpu

Use when:

  • Java or Oracle shows high CPU
  • Need hot thread detection
  • Correlating with perf / jstack

9️⃣ Parent‑Child Relationship Analysis

ps -eo pid,ppid,stat,etime,comm --forest

✅ Great for:

  • Detecting fork storms
  • Tracing hung parent processes
  • Understanding service trees

10️⃣ Minimal “Health Check” Command (Quick & Safe)

ps -eo pid,stat,%cpu,%mem,etime,comm --sort=-%cpu | head -15

✅ Safe for production
✅ Quick triage
✅ Covers 80% of issues


🔑 What to Focus On (Cheat Sheet)

SymptomLook at
High load%cpu, R state
Stuck systemD state, wchan
Slowness%cpu, %mem, etime
Hung DBora_* + D
Memory issuesrss, %mem
Defunct processesZ

✅ Final Recommendation (What to Remember)

If you remember only ONE command, make it this:

ps -eo user,pid,ppid,stat,%cpu,%mem,etime,wchan,comm --sort=-%cpu

This single command gives: ✅ CPU
✅ I/O
✅ Memory
✅ State
✅ Ownership
✅ Runtime
✅ Kernel wait reason

Troubleshoot storage I/O performance issue -- Linux , Oracle


Explain end‑to‑end explanation of the command :

    ps -eo pid,stat,comm | grep D


This is a process inspection command used heavily by Linux, Unix, and database administrators for system and performance troubleshooting.


1️⃣ What is ps?

ps stands for Process Status.
It reports information about currently running processes on a Linux system.

Think of it as a snapshot of processes at the moment you run the command.

📌 Unlike top or htop, ps:

  • Is not interactive
  • Shows a point‑in‑time view
  • Is ideal for scripting and diagnostics

2️⃣ Command Breakdown

ps -eo pid,stat,comm

Let’s split it into parts:


🔹 ps

Invokes the process status utility.


🔹 -e option (select processes)

-e

Means:
Show all processes running on the system

Without -e, ps would only show processes tied to the current terminal (TTY).

Equivalent options:

ps -e
ps -A

All mean “every process”.


🔹 -o option (custom output format)

-o pid,stat,comm

Means:
Choose which columns to display

Instead of default columns, you explicitly request:

FieldMeaning
pidProcess ID
statProcess state
commCommand name (executable)

This is extremely useful for focused troubleshooting.


3️⃣ Output Columns (Explained in Depth)

🔸 PID — Process ID

Example:

24567
  • Unique identifier for a process
  • Assigned by the Linux kernel
  • Required to manage or inspect processes

Used in commands like:

kill 24567
strace -p 24567
cat /proc/24567/status

📌 Notes:

  • PID 1 is always the init/systemd process
  • PIDs are reused after processes exit

🔸 STAT — Process State (most important field)

The STAT column shows:

  1. Main execution state
  2. Additional flags

Primary states

CodeMeaning
RRunning or runnable (on CPU or ready)
SSleeping (waiting for event)
DUninterruptible sleep (I/O wait)
TStopped (signal or debugger)
ZZombie (dead, not cleaned up)
IIdle kernel thread (newer kernels)

👉 The first letter is the core state.


Modifier flags (can appear after the main letter)

FlagMeaning
sSession leader
lMultithreaded (uses threads)
+Foreground process
<High priority
NLow priority

STAT examples explained

Ss
  • S → sleeping
  • s → session leader
    ✅ Normal background service
Ssl+
  • Sleeping
  • Session leader
  • Multithreaded
  • Foreground task
    ✅ Common for DB or Java processes
D

🚨 Critical

  • Process waiting on kernel I/O
  • Cannot be killed (even kill -9)
  • Usually due to:
    • Disk I/O
    • NFS
    • SAN / ASM
    • Kernel storage issue

Examples

Ss

→ Sleeping, session leader

D

→ Blocked on I/O (disk, NFS, storage). Very important state

Ssl+

→ Sleeping, session leader, multithreaded, foreground job

📌 Critical note
If a process is in D state, it:

  • Cannot be killed (kill -9 won’t work)
  • Is usually waiting on disk, SAN, ASM, or NFS
  • Indicates storage or kernel-level issues


🔸 COMMAND — Executable Name

Example:

oracle
sshd
ora_w00l
  • Shows only the binary name
  • Does NOT include command‑line arguments

For full command line:

ps -eo pid,stat,cmd

📌 Oracle example:

ora_w00l

Means:

  • ora_ → Oracle process
  • w00l → Parallel/worker process

4️⃣ Sample Output and Interpretation

PID STAT COMMAND
1 Ss systemd
1023 Ssl oracle
2045 D ora_dbw0

How to read this:

  • systemd → sleeping session leader (normal)
  • oracle → sleeping, multithreaded (normal)
  • ora_dbw0D state (problem)
    → Indicates disk or ASM issue

5️⃣ Why this command is widely used

✅ Lightweight and fast

  • No interactive overhead
  • Safe on production systems

✅ Perfect for troubleshooting

  • Detects:
    • Hung processes
    • Storage stalls
    • Zombie accumulation
    • Oracle background issues

✅ Script‑friendly

Used inside:

  • Shell scripts
  • Health checks
  • Cron jobs

6️⃣ Common Enhancements

Show only blocked (D) processes

ps -eo pid,stat,comm | awk '$2 ~ /D/'


Sort by process state

ps -eo pid,stat,comm --sort=stat

Add user and CPU usage

ps -eo user,pid,stat,%cpu,%mem,comm


7️⃣ Practical Use Case (Oracle / DB servers)

DBAs frequently use:

ps -eo pid,stat,comm | grep ora_

To detect:

  • Stuck background workers
  • DBWR/LGWR waiting on disk
  • Parallel query stalls

If many ora_* processes show D: 🚨 Storage team must be involved immediately


✅ Final Summary

ComponentPurpose
psShow process snapshot
-eInclude all processes
-oCustomize output
pidProcess identifier
statExecution + wait state
commExecutable name

🎯 Key troubleshooting signal

  • R, S → Normal
  • DI/O or kernel problem
  • Z → Parent process issue


  • PID → Unique process identifier
  • STAT → Current state + extra flags (critical for troubleshooting)
  • COMMAND → Executable name

🎯 For troubleshooting:

  • R / S → Normal
  • DInvestigate immediately
  • Z → Parent process issue

Oracle Database Disk Storage Slowness Troubleshooting (RHEL) - I/O issue

 

Oracle Database Disk Storage Slowness Troubleshooting (RHEL)




Command :
ps -eo pid,stat,comm | grep D

Meaning

  • ps -e → show all processes
  • -o pid,stat,comm → display:
    • pid → process ID
    • stat → process state
    • comm → command name
  • grep D → filter processes whose STAT column contains D

What D means

D = Uninterruptible sleep
This usually means the process is:

  • Waiting on I/O
  • Typically stuck on disk, NFS, SAN, or kernel I/O
  • Cannot be killed (kill -9 won’t work) until the I/O returns

This is often serious on production systems.


iostat -xz 1 5


Ss

This shows the Oracle process state at the time of capture:

  • S = sleeping
  • s = secondary sleep state

So the process was waiting (idle or blocked), not crashing at that exact moment.



1. Typical Symptoms (What triggers investigation)

  • High load average on DB server
  • User complaints: slow queries, commits, batch delays
  • AWR shows:
    • db file sequential read
    • db file scattered read
    • log file sync
    • log file parallel write
  • OS metrics show high IO wait (%wa)
  • RMAN / backups running slow

2. Step‑1: Validate System Load & CPU Wait

✅ Identify load average vs cores

uptime
nproc

Interpretation

  • Load ≈ number of CPU cores → OK
  • Load >> cores + high IO wait → likely disk bottleneck

✅ Check CPU & IO wait

top

or (better)

vmstat 1 10

Look for:

  • %wa (IO wait) consistently > 15–20%
  • Low %id while CPUs are idle but blocked

Example

r  b   swpd   free  buff cache   si so bi bo   in   cs us sy id wa st
8  12     0   812M  122M  18G     0  0  45 620  900 1200 10  6 40 44 0

➡️ High b and wa = blocked on disk


3. Step‑2: Disk Latency at OS Level (Most Important)

✅ iostat – PRIMARY disk latency tool

iostat -xm 1 10

Key columns:

MetricMeaningProblem Threshold
r_awaitRead latency> 20 ms (OLTP), > 50 ms (DW)
w_awaitWrite latency> 10–15 ms
awaitAvg IO latency> 20 ms
%utilDisk busy> 80–90% sustained
aqu-szAvg queue sizeGrowing steadily = queueing

Example (Bad)

Device:  r/s   w/s  r_await  w_await  await  aqu-sz %util
sdb      420   350   48.12    32.22    40.01   18.3   97.4

➡️ Storage saturation confirmed


4. Step‑3: Identify Which Filesystems / Disks

✅ Map disks → mount points

df -hT
lsblk -f

✅ Per‑filesystem IO usage

iostat -xm 1 10 | grep -E "sd|nvme"

Check:

  • Datafiles disk
  • Redo log disk
  • FRA disk
  • Temp disk

5. Step‑4: Per‑Process Confirmation (Oracle vs others)

✅ pidstat – correlate Oracle background processes

pidstat -d 1 10 | grep ora_

Key offenders:

  • ora_dbw* → datafile writes
  • ora_lgwr → redo log writes
  • ora_ckpt
  • RMAN channels

High KB/s + delays = database IO bottleneck


6. Step‑5: Advanced Disk & Queue Observation

✅ sar (historical if available)

sar -d 1 5

✅ IO pressure (RHEL 8+)

cat /proc/pressure/io

If avg10 and avg60 > 10–20 → sustained storage pressure


7. Step‑6: Oracle Database Wait Event Validation

✅ Top waits (Instance level)

SELECT event, total_waits, time_waited/100 AS time_waited_sec
FROM v$system_event
WHERE event LIKE 'db file%'
OR event LIKE 'log file%'
ORDER BY time_waited DESC;


✅ Real‑time waits (active sessions)

SELECT sid, event, wait_time, seconds_in_wait
FROM v$session
WHERE wait_class = 'User I/O'
ORDER BY seconds_in_wait DESC;


8. Step‑7: File Type & Latency inside Oracle

✅ File-level IO latency

SELECT df.name,
fs.phyrds,
fs.phywrts,
fs.readtim/1000 AS read_sec,
fs.writetim/1000 AS write_sec
FROM v$datafile df, v$filestat fs
WHERE df.file# = fs.file#
ORDER BY fs.readtim DESC;

✅ Tablespace hotspot

SELECT tablespace_name,
SUM(physical_reads) reads,
SUM(physical_writes) writes
FROM v$segment_statistics
GROUP BY tablespace_name
ORDER BY reads DESC;

9. Step‑8: Redo Log Latency (Very Common OLTP Issue)

✅ LGWR wait

SELECT event, total_waits, time_waited/100 AS time_waited_sec
FROM v$system_event
WHERE event IN ('log file sync','log file parallel write');

Interpretation

  • log file sync waits high → commit delayed
  • log file parallel write high → redo disk slow

✅ Validate redo disks with:

iostat -xm 1 10 <redo_disk>


10. Step‑9: ASM (If Applicable)

✅ ASM disk stats

SELECT name, total_mb, free_mb, read_errs, write_errs
FROM v$asm_disk;

✅ ASM IO latency

SELECT dg.name, fs.*
FROM v$asm_diskgroup dg, v$asm_disk_iostat fs
WHERE dg.group_number = fs.group_number;


11. Correlation Checklist (OS ⇄ Oracle)

Disk problem confirmed if ALL match

  • High %wa in vmstat
  • High await in iostat
  • High db file* or log file* waits in Oracle
  • %util near 100% on affected disks
  • Load average high but CPU idle present

12. Common Root Causes

CauseHow it Appears
Storage array saturationHigh await + util
Poor redo disklog file sync waits
Temp spillsdb file scattered read
RMAN / backupDBWn writes spike
Thin provisioningLatency spikes under load
Too many LUNs on same backendRandom latency

13. Immediate Mitigations

✅ Short‑term:

  • Pause backups / RMAN
  • Kill runaway sessions
  • Reduce parallelism
  • Move redo logs to faster disks

✅ Medium‑term:

  • Separate redo, data, temp
  • Increase redo log size
  • Add disks / IOPS
  • ASM rebalance / re‑stripe

✅ Long‑term:

  • Storage tiering (NVMe for redo)
  • Oracle I/O calibration (ORION)
  • Capacity & growth planning

14. One‑Command Quick Triage Bundle

uptime
vmstat 1 5
iostat -xm 1 5
pidstat -d 1 5 | grep ora_

Then Oracle

SELECT event, time_waited/100 AS wait_sec
FROM v$system_event
ORDER BY time_waited DESC;



15. Key Rule of Thumb (Production Oracle)

IO TypeAcceptable Latency
Redo writes< 5 ms
OLTP reads< 10–15 ms
Mixed workload< 20 ms
Anything > 30 msProblem

Production Server/Database/Application troubleshooting Runbook for Issue like CPU, Memory, I/o , Kernel

  0️⃣ Runbook Objectives This runbook helps you: ✅ Quickly identify CPU, I/O, memory, or process issues ✅ Correlate OS metrics with database...