Wednesday, April 29, 2026

Oracle Database CPU Performance Troubleshooting

Oracle Database CPU Performance Troubleshooting (RHEL)

1. Typical CPU Performance Symptoms

High CPU utilization (near 100%)
Load average increases rapidly
Slow query response while disk latency is normal
AWR shows:
- DB CPU
- CPU time
- resmgr:cpu quantum
OS %idle very low and %wa low (CPU, not IO)

2. Step‑1: Confirm CPU Pressure at OS Level

✅ Check load vs CPU cores

uptime

nproc

lscpu | egrep 'CPU$s$|Core|Socket|Thread'

Interpretation

Load average ≤ CPU cores → system OK
Load average >> CPU cores → CPU contention

Example:

load average: 48.21, 45.12, 40.90
CPU cores: 24

➡️ CPU saturation confirmed

3. Step‑2: Identify CPU Utilization & Wait

✅ top (quick overview)Focus on top line:

top

%Cpu(s): 92.4 us, 3.1 sy, 0.2 ni, 2.0 id, 1.5 wa

CPU‑bound system

%us + %sy high
%id near 0
%wa low

✅ vmstat – verify run queue

vmstat 1 10

Key columns:

Column	Meaning
`r`	Runnable processes
`us`	User CPU
`sy`	System CPU
`id`	Idle
`wa`	IO wait

If r > CPU cores consistently → CPU contention

4. Step‑3: Identify Top CPU Consumers

✅ Per‑process CPU usage

ps -eo pid,ppid,cmd,%cpu,%mem --sort=-%cpu | head -20

Look for:

Oracle foreground sessions
Oracle background processes (ora_*)
Non‑DB processes stealing CPU

✅ Real‑time per‑process view

top -H -p $(pgrep -d',' -f ora_)

This shows Oracle threads consuming CPU.

5. Step‑4: Per‑CPU & Core Imbalance

✅ mpstat (very important on NUMA)

mpstat -P ALL 1 5

Look for:

Some CPUs at 100%
Others mostly idle

➡️ Indicates:

NUMA imbalance
CPU affinity or pinning problem

6. Step‑5: Context Switching & Syscall Overhead

✅ pidstat – CPU & context switches

pidstat -u -w 1 5

High:

cswch/s (context switches)
nvcswch/s

➡️ Too many active sessions or excessive parsing

7. Step‑6: Oracle Side – Is Database the CPU Consumer?

✅ CPU usage inside Oracle

SELECT stat_name, value

FROM v$sysstat

WHERE stat_name IN ('CPU used by this session','DB CPU');

Check ratio:

DB CPU close to total DB time → CPU bottleneck
IO waits dominant → not CPU issue

8. Step‑7: Identify Oracle Sessions Using CPU

✅ Active CPU sessions

SELECT s.sid, s.serial#, s.username,

s.program, s.status,

p.spid OS_PID

FROM v$session s, v$process p

WHERE s.paddr = p.addr

AND s.status = 'ACTIVE'

AND s.wait_class = 'CPU'

ORDER BY s.last_call_et DESC;

🔁 Correlate SPID with OS:

top -p <spid>

9. Step‑8: SQL Responsible for CPU Usage

✅ Top CPU SQL

SELECT sql_id,

cpu_time/1000000 cpu_sec,

executions,

cpu_time/executions avg_cpu

FROM v$sql

ORDER BY cpu_time DESC

FETCH FIRST 10 ROWS ONLY;

Red flags:

High avg_cpu
Low executions but high total CPU
Cartesian joins
Missing indexes

10. Step‑9: Run Queue vs CPU Throttling

✅ Check cgroups / VM CPU limits

cat /sys/fs/cgroup/cpu/cpu.cfs_quota_us

cat /sys/fs/cgroup/cpu/cpu.cfs_period_us

Quota < total cores → CPU throttling

11. Step‑10: Oracle Resource Manager (Very Common)

✅ Check RM activation

SHOW PARAMETER resource_manager_plan;

✅ CPU waits due to RM

SELECT event, total_waits, time_waited/100 time_sec

FROM v$system_event

WHERE event LIKE 'resmgr%';

If high:

Sessions are being CPU throttled intentionally

12. Common CPU Root Causes

Cause	Symptom
Bad execution plan	Single SQL consumes most CPU
Missing index	Full scans
High parse rate	Hard parsing
Too many sessions	Context switching
CPU oversubscription	VM / container CPU cap
NUMA imbalance	Some CPUs overloaded

13. Immediate Mitigation Actions

✅ Short‑term:

Kill runaway sessions
Reduce parallelism
Disable unnecessary jobs
Temporarily increase CPU shares

✅ Medium‑term:

SQL tuning (indexes, hints)
Bind variables
Increase cursor cache
Enable Result Cache (where valid)

✅ Long‑term:

SQL baselines
CPU capacity increase
NUMA pinning
Resource Manager redesign

14. One‑Shot CPU Triage Command Set

uptime

vmstat 1 5

mpstat -P ALL 1 5

top -H -p $(pgrep -d',' -f ora_)

Oracle:

SELECT sql_id, cpu_time/1000000 cpu_sec

FROM v$sql

ORDER BY cpu_time DESC FETCH FIRST 5 ROWS ONLY;

15. CPU vs IO – Golden Rule

Observation	Root Cause
High `%us`, low `%wa`	CPU bottleneck
High load, CPUs idle	IO bottleneck
High `DB CPU` in AWR	SQL tuning required
RM waits	Resource controls

Monday, April 27, 2026

Production Server/Database/Application troubleshooting Runbook for Issue like CPU, Memory, I/o , Kernel

0️⃣ Runbook Objectives

This runbook helps you:

✅ Quickly identify CPU, I/O, memory, or process issues
✅ Correlate OS metrics with database / application symptoms
✅ Avoid random commands during incidents
✅ Reach root cause, not just symptom relief

1️⃣ Incident Intake (ALWAYS FIRST)

Before touching the system, collect:

Question	Why
What is impacted? (DB, app, batch, login)	Scope
Since when?	Time correlation
All users or subset?	Severity
Any recent changes?	Deploy / patch
Error messages?	Symptom confirmation

📌 Do not skip this step. It saves 30–40% time later.

2️⃣ High‑Level System Health Snapshot (30 seconds)

2.1 Uptime & Load

uptime

Focus on:

Load average vs CPU cores
Sudden spike timeframe

✅ Load > CPU count → investigate
✅ Load + low CPU → likely I/O wait

2.2 Disk Space (quick sanity)

df -h

🚨 Any filesystem ≥ 90% → fix immediately

3️⃣ Real‑Time CPU & Memory View (`top`)

top

3.1 CPU Line Interpretation

%Cpu(s): 12 us, 5 sy, 0 ni, 60 id, 23 wa

Metric	Meaning
`us`	App/SQL CPU
`sy`	Kernel CPU
`id`	Idle
`wa`	I/O wait

🚨 wa > 15% → jump to I/O section

3.2 Process Area

Sort by CPU: Shift + P
Sort by MEM: Shift + M

Note:

PID
%CPU
%MEM
COMMAND

👉 Take PID(s) for next step.

4️⃣ Exact Process Diagnosis (`ps`)

✅ Master Command (use this by default)

ps -eo user,pid,ppid,stat,%cpu,%mem,etime,wchan,comm --sort=-%cpu

What to Look For

Symptom	Indicator
High CPU	`%cpu`, `R` state
Hung process	`D` state
Long running	High `etime`
I/O wait	`wchan` = io_schedule
Zombie	`Z`

🔴 Critical: Find Blocked (I/O) Processes

ps -eo pid,stat,wchan,etime,comm | awk '$2 ~ /D/'

If you see:

ora_*
java
mysqld

➡️ Storage / filesystem issue almost certain

5️⃣ Kernel Stress View (`vmstat`)

vmstat 1 5

Key Columns

Column	Meaning
`r`	Runnable processes
`b`	Blocked (I/O wait)
`si/so`	Swap usage
`wa`	I/O wait

Interpretation Rules

Observation	Meaning
`b > 0`	Processes stuck in I/O
`wa high`	Disk latency
`si/so > 0`	Memory pressure
`r > CPU cores`	CPU contention

6️⃣ Storage Diagnosis (`iostat`)

iostat -xz 1 5

Critical Metrics

Metric	Bad Threshold
`%util`	> 80%
`await`	> 20 ms
`await >> svctm`	Queueing issue

Conclusions

Pattern	Root Cause
High await + D state	Storage latency
High util	Disk saturation
NFS disks slow	Network / mount issue

🚨 DBWR / LGWR in D state = immediate escalation

7️⃣ Memory Focused Check

ps -eo pid,stat,rss,vsz,%mem,comm --sort=-rss | head

If:

RSS very high
Swap active

➡️ Tune memory / restart leaking service

8️⃣ Oracle / Database‑Specific Quick Checks

8.1 Oracle Processes

ps -eo pid,stat,%cpu,wchan,etime,comm | grep ora_

8.2 Dangerous Signs

Process	Issue
ora_dbw* in D	Datafile I/O
ora_lgwr in D	Redo disk
Many ora_w* in D	Parallel I/O stall

➡️ Do NOT bounce DB blindly

9️⃣ Decision Matrix (Very Important)

Observation	Action
High CPU, no D	Tune app/SQL
High wa + D	Storage escalation
Z processes	Restart parent
Swap active	Add memory / reduce usage
Disk full	Cleanup immediately

🔔 Escalation Triggers

Escalate to Storage / Infra when:

D state persists > 5 minutes
await > 50 ms
DB background processes blocked

Escalate to App/DB Team when:

CPU us > 80%
Single PID dominating CPU
SQL identified as hot spot

✅ One‑Glance Incident Command Set

uptime

top

ps -eo pid,stat,%cpu,%mem,wchan,comm --sort=-%cpu

vmstat 1 5

iostat -xz 1 5

🧠 Golden Rule (Remember This)

High load does NOT always mean CPU problem.
D state + wa = storage until proven otherwise.

Step by step troubleshoot performance issue Linux , Oracle - CPU , Memory, I/O

🔍 Correlation of `ps`, `top`, `iostat`, and `vmstat`

Mental Model (Very Important)

Tool	Answers the Question
ps	Which exact process is responsible?
top	Is the problem CPU or memory pressure right now?
iostat	Is storage slow or saturated?
vmstat	Is the kernel under memory / run‑queue / I/O stress?

👉 Never use only one tool.
Real root cause comes from correlating outputs.

1️⃣ `top` — Real‑Time CPU & Memory Pressure

Best usage

top

(or top -o %CPU on newer systems)

What to focus on (top header)

%Cpu(s): 85.2 us, 10.1 sy,  0.0 ni,  2.0 id,  2.5 wa

Field	Meaning
`us`	User CPU (app / DB code)
`sy`	Kernel CPU
`id`	Idle CPU
`wa`	I/O wait (very important)

Interpretation

✅ High us → application or SQL CPU
✅ High sy → kernel, system calls, networking
🚨 High wa → storage problem, not CPU

Process section (top half)

PID   USER  %CPU  %MEM  COMMAND
2456  oracle 180.2 12.3 ora_dbw0

Now you jump to ps.

2️⃣ `ps` — Identify the Exact Culprit

Correlate with:

ps -eo pid,ppid,stat,%cpu,%mem,etime,wchan,comm | sort -k5 -nr | head

Key correlation

`top` shows	`ps` confirms
High CPU PID	`%cpu`, `etime`
Hung process	`STAT = D`
Storage wait	`wchan = io_schedule`

🚨 Example:

2456  D  io_schedule  ora_dbw0

👉 This tells you:
DBWR is blocked on disk I/O

Now you must check storage.

3️⃣ `iostat` — Storage Bottleneck Detection

Best command

iostat -xz 1 5

Critical columns

Column	Meaning
`%util`	Disk busy time
`await`	Avg I/O latency (ms)
`svctm`	Disk service time
`r/s w/s`	Read/write rate

Interpretation Rules (Golden)

Symptom	Meaning
`%util > 80%`	Disk saturated
`await > 20 ms`	Storage slow
`await >> svctm`	Queueing problem
High writes + DBWR stuck	Redo / data disk issue

🚨 Example:

sda  %util=99.8  await=120ms

✅ Confirms ps + top → storage root cause

4️⃣ `vmstat` — Kernel Stress & Memory I/O

Best command

vmstat 1 5

Key columns

r  b   swpd   free   buff  cache  si so   bi bo   in cs us sy id wa

Important fields explained

Column	Meaning
`r`	Run queue (CPU demand)
`b`	Blocked processes (I/O)
`si/so`	Swap in/out
`bi/bo`	Block I/O
`wa`	I/O wait (kernel view)

Correlation logic

vmstat shows	Combined meaning
`b > 0`	Processes stuck in I/O
`wa high`	CPU waiting for disk
`r > CPU cores`	CPU contention
`si/so > 0`	Memory pressure

🚨 Example:

r=1 b=6 wa=40

👉 Matches:

ps → many D
top → high I/O wait
iostat → high disk latency

🎯 Root cause confirmed: storage

5️⃣ End‑to‑End Correlation Scenarios

✅ Scenario A: High Load Average

Observations

uptime → load = 20
top → CPU idle
vmstat → b=10, wa=35
ps → many D state
iostat → high await

✅ Conclusion
Load is from I/O wait, not CPU
👉 Storage team issue

✅ Scenario B: CPU Spike

Observations

top → %us=90
vmstat → r > CPU cores
ps → process in R state
iostat → normal

✅ Conclusion
Pure CPU problem
👉 Tune SQL / app / threads

✅ Scenario C: Hung Oracle Instance

Observations

ps → ora_dbw0, ora_lgwr in D
vmstat → b > 5
iostat → redo disk latency
top → high wa

✅ Conclusion
Redo or data disk I/O stall
👉 SAN / ASM / NFS issue

6️⃣ Golden Troubleshooting Workflow (Memorize This)

symptom →

top →

ps →

vmstat →

iostat →

root cause

One‑liner sequence

top

ps -eo pid,stat,%cpu,wchan,comm | grep D

vmstat 1 5

iostat -xz 1 5

✅ Final Cheat Sheet

Tool	Best for
`top`	Live CPU/memory
`ps`	Exact process & state
`vmstat`	Kernel & wait queues
`iostat`	Disk latency & saturation

🎯 Never trust a single tool
Real diagnosis = correlation

Troubleshooting CPU - I/O : Best linux ps Command Arguments for Troubleshooting - One‑Command “Master View” (Highly Recommended)

One‑Command “Master View” (Highly Recommended)

ps -eo user,pid,ppid,stat,%cpu,%mem,etime,lstart,wchan,comm --sort=-%cpu

🔍 What it shows (and why it matters)

Field	Why it’s important
`user`	Who owns the process
`pid`	Process ID
`ppid`	Parent process (helps detect orphans)
`stat`	Process state (R/S/D/Z/T)
`%cpu`	CPU consumption
`%mem`	Memory usage
`etime`	How long the process has been running
`lstart`	Exact start time
`wchan`	Kernel wait channel (I/O diagnosis)
`comm`	Executable name
`--sort=-%cpu`	Top CPU consumers first

✅ This is your best single snapshot for general troubleshooting

2️⃣ CPU Troubleshooting (High CPU / Run Queues)

ps -eo pid,ppid,stat,psr,pri,ni,%cpu,time,comm --sort=-%cpu | head -20

Key columns

Column	Meaning
`psr`	Which CPU core it’s running on
`pri`	Kernel priority
`ni`	Nice value
`time`	Total CPU time consumed

✅ Use when:

Load average is high
CPU is saturated
Performance complaints

3️⃣ I/O Troubleshooting (MOST CRITICAL)

🔥 Identify blocked processes (`D` state)

ps -eo pid,stat,wchan,%cpu,etime,comm | awk '$2 ~ /D/'

Why this is powerful

Field	Purpose
`D`	Uninterruptible sleep (I/O wait)
`wchan`	What kernel function it’s stuck on
`etime`	How long it has been blocked

Common `wchan` values and meaning

wchan	Meaning
`io_schedule`	Disk I/O wait
`wait_on_page_bit`	Memory/disk interaction
`nfs_wait`	NFS hang
`blk_mq_get_tag`	Storage queue congestion

🚨 If Oracle or DB processes appear here → storage issue almost guaranteed

4️⃣ Memory & Leak Detection

ps -eo pid,ppid,stat,rss,vsz,%mem,comm --sort=-rss | head -20

Key fields

Field	Meaning
`rss`	Real memory in KB
`vsz`	Virtual memory
`%mem`	RAM usage

✅ Use when:

System is swapping
OOM killer events
Slow performance despite low CPU

5️⃣ Full Command, Arguments & Environment

ps -eo pid,stat,%cpu,%mem,cmd --sort=-%cpu

Why this matters:

cmd shows complete arguments
Crucial for:
- Java tuning
- Oracle startup flags
- Application misconfiguration

6️⃣ Zombie Process Detection

ps -eo pid,ppid,stat,etime,comm | awk '$3 ~ /Z/'

Why care?

Zombies indicate parent process bug
Can exhaust PID space
Need parent restart (not kill)

7️⃣ Oracle / Database‑Focused View (DBA Favorite)

ps -eo pid,stat,%cpu,%mem,etime,wchan,comm | grep ora_

✅ Detects:

DBWR / LGWR I/O stalls
Parallel worker hangs
Backup‑related blockages

8️⃣ Thread‑Level Analysis (Advanced CPU Debugging)

ps -eLo pid,lwp,stat,%cpu,psr,comm --sort=-%cpu

Use when:

Java or Oracle shows high CPU
Need hot thread detection
Correlating with perf / jstack

9️⃣ Parent‑Child Relationship Analysis

ps -eo pid,ppid,stat,etime,comm --forest

✅ Great for:

Detecting fork storms
Tracing hung parent processes
Understanding service trees

10️⃣ Minimal “Health Check” Command (Quick & Safe)

ps -eo pid,stat,%cpu,%mem,etime,comm --sort=-%cpu | head -15

✅ Safe for production
✅ Quick triage
✅ Covers 80% of issues

🔑 What to Focus On (Cheat Sheet)

Symptom	Look at
High load	`%cpu`, `R` state
Stuck system	`D` state, `wchan`
Slowness	`%cpu`, `%mem`, `etime`
Hung DB	`ora_*` + `D`
Memory issues	`rss`, `%mem`
Defunct processes	`Z`

✅ Final Recommendation (What to Remember)

If you remember only ONE command, make it this:

ps -eo user,pid,ppid,stat,%cpu,%mem,etime,wchan,comm --sort=-%cpu

This single command gives: ✅ CPU
✅ I/O
✅ Memory
✅ State
✅ Ownership
✅ Runtime
✅ Kernel wait reason

Troubleshoot storage I/O performance issue -- Linux , Oracle

Explain end‑to‑end explanation of the command :

ps -eo pid,stat,comm | grep D

This is a process inspection command used heavily by Linux, Unix, and database administrators for system and performance troubleshooting.

1️⃣ What is `ps`?

ps stands for Process Status.
It reports information about currently running processes on a Linux system.

Think of it as a snapshot of processes at the moment you run the command.

📌 Unlike top or htop, ps:

Is not interactive
Shows a point‑in‑time view
Is ideal for scripting and diagnostics

2️⃣ Command Breakdown

ps -eo pid,stat,comm

Let’s split it into parts:

🔹 `ps`

Invokes the process status utility.

🔹 `-e` option (select processes)

-e

Means:
✅ Show all processes running on the system

Without -e, ps would only show processes tied to the current terminal (TTY).

Equivalent options:

ps -e

ps -A

All mean “every process”.

🔹 `-o` option (custom output format)

-o pid,stat,comm

Means:
✅ Choose which columns to display

Instead of default columns, you explicitly request:

Field	Meaning
`pid`	Process ID
`stat`	Process state
`comm`	Command name (executable)

This is extremely useful for focused troubleshooting.

3️⃣ Output Columns (Explained in Depth)

🔸 PID — Process ID

Example:

Unique identifier for a process
Assigned by the Linux kernel
Required to manage or inspect processes

Used in commands like:

kill 24567

strace -p 24567

cat /proc/24567/status

📌 Notes:

PID 1 is always the init/systemd process
PIDs are reused after processes exit

🔸 STAT — Process State (most important field)

The STAT column shows:

Main execution state
Additional flags

Primary states

Code	Meaning
`R`	Running or runnable (on CPU or ready)
`S`	Sleeping (waiting for event)
`D`	Uninterruptible sleep (I/O wait)
`T`	Stopped (signal or debugger)
`Z`	Zombie (dead, not cleaned up)
`I`	Idle kernel thread (newer kernels)

👉 The first letter is the core state.

Modifier flags (can appear after the main letter)

Flag	Meaning
`s`	Session leader
`l`	Multithreaded (uses threads)
`+`	Foreground process
`<`	High priority
`N`	Low priority

STAT examples explained

Ss

S → sleeping
s → session leader
✅ Normal background service

Ssl+

Sleeping
Session leader
Multithreaded
Foreground task
✅ Common for DB or Java processes

🚨 Critical

Process waiting on kernel I/O
Cannot be killed (even kill -9)
Usually due to:

Disk I/O
NFS
SAN / ASM
Kernel storage issue

Examples

Ss

→ Sleeping, session leader

→ Blocked on I/O (disk, NFS, storage). Very important state

Ssl+

→ Sleeping, session leader, multithreaded, foreground job

📌 Critical note
If a process is in D state, it:

Cannot be killed (kill -9 won’t work)
Is usually waiting on disk, SAN, ASM, or NFS
Indicates storage or kernel-level issues

🔸 COMMAND — Executable Name

Example:

oracle
sshd
ora_w00l

Shows only the binary name
Does NOT include command‑line arguments

For full command line:

ps -eo pid,stat,cmd

📌 Oracle example:

ora_w00l

Means:

ora_ → Oracle process
w00l → Parallel/worker process

4️⃣ Sample Output and Interpretation

PID STAT COMMAND

1 Ss systemd

1023 Ssl oracle

2045 D ora_dbw0

How to read this:

systemd → sleeping session leader (normal)
oracle → sleeping, multithreaded (normal)
ora_dbw0 → D state (problem)
→ Indicates disk or ASM issue

5️⃣ Why this command is widely used

✅ Lightweight and fast

No interactive overhead
Safe on production systems

✅ Perfect for troubleshooting

Detects:
- Hung processes
- Storage stalls
- Zombie accumulation
- Oracle background issues

✅ Script‑friendly

Used inside:

Shell scripts
Health checks
Cron jobs

6️⃣ Common Enhancements

Show only blocked (`D`) processes

ps -eo pid,stat,comm | awk '$2 ~ /D/'

Sort by process state

ps -eo pid,stat,comm --sort=stat

Add user and CPU usage

ps -eo user,pid,stat,%cpu,%mem,comm

7️⃣ Practical Use Case (Oracle / DB servers)

DBAs frequently use:

ps -eo pid,stat,comm | grep ora_

To detect:

Stuck background workers
DBWR/LGWR waiting on disk
Parallel query stalls

If many ora_* processes show D: 🚨 Storage team must be involved immediately

✅ Final Summary

Component	Purpose
`ps`	Show process snapshot
`-e`	Include all processes
`-o`	Customize output
`pid`	Process identifier
`stat`	Execution + wait state
`comm`	Executable name

🎯 Key troubleshooting signal

R, S → Normal
D → I/O or kernel problem
Z → Parent process issue

PID → Unique process identifier
STAT → Current state + extra flags (critical for troubleshooting)
COMMAND → Executable name

🎯 For troubleshooting:

R / S → Normal
D → Investigate immediately
Z → Parent process issue

Wednesday, April 29, 2026

Oracle Database CPU Performance Troubleshooting (RHEL)

1. Typical CPU Performance Symptoms

2. Step‑1: Confirm CPU Pressure at OS Level

✅ Check load vs CPU cores

3. Step‑2: Identify CPU Utilization & Wait

✅ top (quick overview)Focus on top line:

✅ vmstat – verify run queue

4. Step‑3: Identify Top CPU Consumers

✅ Per‑process CPU usage

✅ Real‑time per‑process view

5. Step‑4: Per‑CPU & Core Imbalance

✅ mpstat (very important on NUMA)

6. Step‑5: Context Switching & Syscall Overhead

✅ pidstat – CPU & context switches

7. Step‑6: Oracle Side – Is Database the CPU Consumer?

✅ CPU usage inside Oracle

8. Step‑7: Identify Oracle Sessions Using CPU

✅ Active CPU sessions

9. Step‑8: SQL Responsible for CPU Usage

✅ Top CPU SQL

10. Step‑9: Run Queue vs CPU Throttling

✅ Check cgroups / VM CPU limits

11. Step‑10: Oracle Resource Manager (Very Common)

✅ Check RM activation

✅ CPU waits due to RM

12. Common CPU Root Causes

13. Immediate Mitigation Actions

14. One‑Shot CPU Triage Command Set

15. CPU vs IO – Golden Rule

Monday, April 27, 2026

0️⃣ Runbook Objectives

1️⃣ Incident Intake (ALWAYS FIRST)

2️⃣ High‑Level System Health Snapshot (30 seconds)

2.1 Uptime & Load

2.2 Disk Space (quick sanity)

3️⃣ Real‑Time CPU & Memory View (top)

3.1 CPU Line Interpretation

3.2 Process Area

4️⃣ Exact Process Diagnosis (ps)

✅ Master Command (use this by default)

What to Look For

🔴 Critical: Find Blocked (I/O) Processes

5️⃣ Kernel Stress View (vmstat)

Key Columns

Interpretation Rules

6️⃣ Storage Diagnosis (iostat)

Critical Metrics

Conclusions

7️⃣ Memory Focused Check

8️⃣ Oracle / Database‑Specific Quick Checks

8.1 Oracle Processes

8.2 Dangerous Signs

9️⃣ Decision Matrix (Very Important)

🔔 Escalation Triggers

✅ One‑Glance Incident Command Set

🧠 Golden Rule (Remember This)

🔍 Correlation of ps, top, iostat, and vmstat

Mental Model (Very Important)

1️⃣ top — Real‑Time CPU & Memory Pressure

Best usage

What to focus on (top header)

Interpretation

Process section (top half)

2️⃣ ps — Identify the Exact Culprit

Correlate with:

Key correlation

3️⃣ iostat — Storage Bottleneck Detection

Best command

Critical columns

Interpretation Rules (Golden)

4️⃣ vmstat — Kernel Stress & Memory I/O

Best command

Key columns

Important fields explained

Correlation logic

5️⃣ End‑to‑End Correlation Scenarios

✅ Scenario A: High Load Average

Observations

✅ Scenario B: CPU Spike

3️⃣ Real‑Time CPU & Memory View (`top`)

4️⃣ Exact Process Diagnosis (`ps`)

5️⃣ Kernel Stress View (`vmstat`)

6️⃣ Storage Diagnosis (`iostat`)

🔍 Correlation of `ps`, `top`, `iostat`, and `vmstat`

1️⃣ `top` — Real‑Time CPU & Memory Pressure

2️⃣ `ps` — Identify the Exact Culprit

3️⃣ `iostat` — Storage Bottleneck Detection

4️⃣ `vmstat` — Kernel Stress & Memory I/O

🔥 Identify blocked processes (`D` state)

Common `wchan` values and meaning

1️⃣ What is `ps`?

🔹 `ps`

🔹 `-e` option (select processes)

🔹 `-o` option (custom output format)

Show only blocked (`D`) processes