Wednesday, April 29, 2026

Oracle Database CPU Performance Troubleshooting

 

Oracle Database CPU Performance Troubleshooting (RHEL)

1. Typical CPU Performance Symptoms

  • High CPU utilization (near 100%)
  • Load average increases rapidly
  • Slow query response while disk latency is normal
  • AWR shows:
    • DB CPU
    • CPU time
    • resmgr:cpu quantum
  • OS %idle very low and %wa low (CPU, not IO)

2. Step‑1: Confirm CPU Pressure at OS Level

✅ Check load vs CPU cores

uptime
nproc
lscpu | egrep 'CPU\(s\)|Core|Socket|Thread'

Interpretation

  • Load average ≤ CPU cores → system OK
  • Load average >> CPU cores → CPU contention

Example:

load average: 48.21, 45.12, 40.90
CPU cores: 24

➡️ CPU saturation confirmed


3. Step‑2: Identify CPU Utilization & Wait

✅ top (quick overview)Focus on top line:


top
%Cpu(s): 92.4 us, 3.1 sy, 0.2 ni, 2.0 id, 1.5 wa

CPU‑bound system

  • %us + %sy high
  • %id near 0
  • %wa low

✅ vmstat – verify run queue

vmstat 1 10

Key columns:

ColumnMeaning
rRunnable processes
usUser CPU
sySystem CPU
idIdle
waIO wait

If r > CPU cores consistently → CPU contention


4. Step‑3: Identify Top CPU Consumers

✅ Per‑process CPU usage

ps -eo pid,ppid,cmd,%cpu,%mem --sort=-%cpu | head -20

Look for:

  • Oracle foreground sessions
  • Oracle background processes (ora_*)
  • Non‑DB processes stealing CPU

✅ Real‑time per‑process view

top -H -p $(pgrep -d',' -f ora_)

This shows Oracle threads consuming CPU.


5. Step‑4: Per‑CPU & Core Imbalance

✅ mpstat (very important on NUMA)

mpstat -P ALL 1 5

Look for:

  • Some CPUs at 100%
  • Others mostly idle

➡️ Indicates:

  • NUMA imbalance
  • CPU affinity or pinning problem

6. Step‑5: Context Switching & Syscall Overhead

✅ pidstat – CPU & context switches

pidstat -u -w 1 5

High:

  • cswch/s (context switches)
  • nvcswch/s

➡️ Too many active sessions or excessive parsing


7. Step‑6: Oracle Side – Is Database the CPU Consumer?

✅ CPU usage inside Oracle

SELECT stat_name, value
FROM v$sysstat
WHERE stat_name IN ('CPU used by this session','DB CPU');

Check ratio:

  • DB CPU close to total DB time → CPU bottleneck
  • IO waits dominant → not CPU issue

8. Step‑7: Identify Oracle Sessions Using CPU

✅ Active CPU sessions

SELECT s.sid, s.serial#, s.username,
s.program, s.status,
p.spid OS_PID
FROM v$session s, v$process p
WHERE s.paddr = p.addr
AND s.status = 'ACTIVE'
AND s.wait_class = 'CPU'
ORDER BY s.last_call_et DESC;

🔁 Correlate SPID with OS:

top -p <spid>


9. Step‑8: SQL Responsible for CPU Usage

✅ Top CPU SQL

SELECT sql_id,
cpu_time/1000000 cpu_sec,
executions,
cpu_time/executions avg_cpu
FROM v$sql
ORDER BY cpu_time DESC
FETCH FIRST 10 ROWS ONLY;

Red flags:

  • High avg_cpu
  • Low executions but high total CPU
  • Cartesian joins
  • Missing indexes

10. Step‑9: Run Queue vs CPU Throttling

✅ Check cgroups / VM CPU limits

cat /sys/fs/cgroup/cpu/cpu.cfs_quota_us
cat /sys/fs/cgroup/cpu/cpu.cfs_period_us

Quota < total cores → CPU throttling


11. Step‑10: Oracle Resource Manager (Very Common)

✅ Check RM activation

SHOW PARAMETER resource_manager_plan;

✅ CPU waits due to RM

SELECT event, total_waits, time_waited/100 time_sec
FROM v$system_event
WHERE event LIKE 'resmgr%';

If high:

  • Sessions are being CPU throttled intentionally

12. Common CPU Root Causes

CauseSymptom
Bad execution planSingle SQL consumes most CPU
Missing indexFull scans
High parse rateHard parsing
Too many sessionsContext switching
CPU oversubscriptionVM / container CPU cap
NUMA imbalanceSome CPUs overloaded

13. Immediate Mitigation Actions

✅ Short‑term:

  • Kill runaway sessions
  • Reduce parallelism
  • Disable unnecessary jobs
  • Temporarily increase CPU shares

✅ Medium‑term:

  • SQL tuning (indexes, hints)
  • Bind variables
  • Increase cursor cache
  • Enable Result Cache (where valid)

✅ Long‑term:

  • SQL baselines
  • CPU capacity increase
  • NUMA pinning
  • Resource Manager redesign

14. One‑Shot CPU Triage Command Set

uptime
vmstat 1 5
mpstat -P ALL 1 5
top -H -p $(pgrep -d',' -f ora_)

Oracle:

SELECT sql_id, cpu_time/1000000 cpu_sec
FROM v$sql
ORDER BY cpu_time DESC FETCH FIRST 5 ROWS ONLY;

15. CPU vs IO – Golden Rule

ObservationRoot Cause
High %us, low %waCPU bottleneck
High load, CPUs idleIO bottleneck
High DB CPU in AWRSQL tuning required
RM waitsResource controls

Monday, April 27, 2026

Production Server/Database/Application troubleshooting Runbook for Issue like CPU, Memory, I/o , Kernel

 

0️⃣ Runbook Objectives

This runbook helps you:

✅ Quickly identify CPU, I/O, memory, or process issues
✅ Correlate OS metrics with database / application symptoms
✅ Avoid random commands during incidents
✅ Reach root cause, not just symptom relief


1️⃣ Incident Intake (ALWAYS FIRST)

Before touching the system, collect:

QuestionWhy
What is impacted? (DB, app, batch, login)Scope
Since when?Time correlation
All users or subset?Severity
Any recent changes?Deploy / patch
Error messages?Symptom confirmation

📌 Do not skip this step. It saves 30–40% time later.


2️⃣ High‑Level System Health Snapshot (30 seconds)

2.1 Uptime & Load

uptime

Focus on:

  • Load average vs CPU cores
  • Sudden spike timeframe

✅ Load > CPU count → investigate
✅ Load + low CPU → likely I/O wait


2.2 Disk Space (quick sanity)

df -h

🚨 Any filesystem ≥ 90% → fix immediately


3️⃣ Real‑Time CPU & Memory View (top)

top

3.1 CPU Line Interpretation

%Cpu(s): 12 us, 5 sy, 0 ni, 60 id, 23 wa
MetricMeaning
usApp/SQL CPU
syKernel CPU
idIdle
waI/O wait

🚨 wa > 15% → jump to I/O section


3.2 Process Area

  • Sort by CPU: Shift + P
  • Sort by MEM: Shift + M

Note:

  • PID
  • %CPU
  • %MEM
  • COMMAND

👉 Take PID(s) for next step.


4️⃣ Exact Process Diagnosis (ps)

✅ Master Command (use this by default)

ps -eo user,pid,ppid,stat,%cpu,%mem,etime,wchan,comm --sort=-%cpu

What to Look For

SymptomIndicator
High CPU%cpu, R state
Hung processD state
Long runningHigh etime
I/O waitwchan = io_schedule
ZombieZ

🔴 Critical: Find Blocked (I/O) Processes

ps -eo pid,stat,wchan,etime,comm | awk '$2 ~ /D/'

If you see:

  • ora_*
  • java
  • mysqld

➡️ Storage / filesystem issue almost certain


5️⃣ Kernel Stress View (vmstat)

vmstat 1 5

Key Columns

ColumnMeaning
rRunnable processes
bBlocked (I/O wait)
si/soSwap usage
waI/O wait

Interpretation Rules

ObservationMeaning
b > 0Processes stuck in I/O
wa highDisk latency
si/so > 0Memory pressure
r > CPU coresCPU contention

6️⃣ Storage Diagnosis (iostat)

iostat -xz 1 5

Critical Metrics

MetricBad Threshold
%util> 80%
await> 20 ms
await >> svctmQueueing issue

Conclusions

PatternRoot Cause
High await + D stateStorage latency
High utilDisk saturation
NFS disks slowNetwork / mount issue

🚨 DBWR / LGWR in D state = immediate escalation


7️⃣ Memory Focused Check

ps -eo pid,stat,rss,vsz,%mem,comm --sort=-rss | head

If:

  • RSS very high
  • Swap active

➡️ Tune memory / restart leaking service


8️⃣ Oracle / Database‑Specific Quick Checks

8.1 Oracle Processes

ps -eo pid,stat,%cpu,wchan,etime,comm | grep ora_

8.2 Dangerous Signs

ProcessIssue
ora_dbw* in DDatafile I/O
ora_lgwr in DRedo disk
Many ora_w* in DParallel I/O stall

➡️ Do NOT bounce DB blindly


9️⃣ Decision Matrix (Very Important)

ObservationAction
High CPU, no DTune app/SQL
High wa + DStorage escalation
Z processesRestart parent
Swap activeAdd memory / reduce usage
Disk fullCleanup immediately

🔔 Escalation Triggers

Escalate to Storage / Infra when:

  • D state persists > 5 minutes
  • await > 50 ms
  • DB background processes blocked

Escalate to App/DB Team when:

  • CPU us > 80%
  • Single PID dominating CPU
  • SQL identified as hot spot

✅ One‑Glance Incident Command Set

uptime
top
ps -eo pid,stat,%cpu,%mem,wchan,comm --sort=-%cpu
vmstat 1 5
iostat -xz 1 5

🧠 Golden Rule (Remember This)

High load does NOT always mean CPU problem.
D state + wa = storage until proven otherwise.

Step by step troubleshoot performance issue Linux , Oracle - CPU , Memory, I/O

 

🔍 Correlation of ps, top, iostat, and vmstat

Mental Model (Very Important)

ToolAnswers the Question
psWhich exact process is responsible?
topIs the problem CPU or memory pressure right now?
iostatIs storage slow or saturated?
vmstatIs the kernel under memory / run‑queue / I/O stress?

👉 Never use only one tool.
Real root cause comes from correlating outputs.


1️⃣ top — Real‑Time CPU & Memory Pressure

Best usage

top

(or top -o %CPU on newer systems)

What to focus on (top header)

%Cpu(s): 85.2 us, 10.1 sy,  0.0 ni,  2.0 id,  2.5 wa
FieldMeaning
usUser CPU (app / DB code)
syKernel CPU
idIdle CPU
waI/O wait (very important)

Interpretation

  • ✅ High us → application or SQL CPU
  • ✅ High sy → kernel, system calls, networking
  • 🚨 High wastorage problem, not CPU

Process section (top half)

PID   USER  %CPU  %MEM  COMMAND
2456  oracle 180.2 12.3 ora_dbw0

Now you jump to ps.


2️⃣ ps — Identify the Exact Culprit

Correlate with:

ps -eo pid,ppid,stat,%cpu,%mem,etime,wchan,comm | sort -k5 -nr | head

Key correlation

top showsps confirms
High CPU PID%cpu, etime
Hung processSTAT = D
Storage waitwchan = io_schedule

🚨 Example:

2456  D  io_schedule  ora_dbw0

👉 This tells you:
DBWR is blocked on disk I/O

Now you must check storage.


3️⃣ iostat — Storage Bottleneck Detection

Best command

iostat -xz 1 5

Critical columns

ColumnMeaning
%utilDisk busy time
awaitAvg I/O latency (ms)
svctmDisk service time
r/s w/sRead/write rate

Interpretation Rules (Golden)

SymptomMeaning
%util > 80%Disk saturated
await > 20 msStorage slow
await >> svctmQueueing problem
High writes + DBWR stuckRedo / data disk issue

🚨 Example:

sda  %util=99.8  await=120ms

✅ Confirms ps + topstorage root cause


4️⃣ vmstat — Kernel Stress & Memory I/O

Best command

vmstat 1 5

Key columns

r  b   swpd   free   buff  cache  si so   bi bo   in cs us sy id wa

Important fields explained

ColumnMeaning
rRun queue (CPU demand)
bBlocked processes (I/O)
si/soSwap in/out
bi/boBlock I/O
waI/O wait (kernel view)

Correlation logic

vmstat showsCombined meaning
b > 0Processes stuck in I/O
wa highCPU waiting for disk
r > CPU coresCPU contention
si/so > 0Memory pressure

🚨 Example:

r=1 b=6 wa=40

👉 Matches:

  • ps → many D
  • top → high I/O wait
  • iostat → high disk latency

🎯 Root cause confirmed: storage


5️⃣ End‑to‑End Correlation Scenarios


✅ Scenario A: High Load Average

Observations

  • uptime → load = 20
  • top → CPU idle
  • vmstatb=10, wa=35
  • ps → many D state
  • iostat → high await

Conclusion
Load is from I/O wait, not CPU
👉 Storage team issue


✅ Scenario B: CPU Spike

Observations

  • top%us=90
  • vmstatr > CPU cores
  • ps → process in R state
  • iostat → normal

Conclusion
Pure CPU problem
👉 Tune SQL / app / threads


✅ Scenario C: Hung Oracle Instance

Observations

  • psora_dbw0, ora_lgwr in D
  • vmstatb > 5
  • iostat → redo disk latency
  • top → high wa

Conclusion
Redo or data disk I/O stall
👉 SAN / ASM / NFS issue


6️⃣ Golden Troubleshooting Workflow (Memorize This)


symptom →
top →
ps →
vmstat →
iostat →
root cause

One‑liner sequence

top
ps -eo pid,stat,%cpu,wchan,comm | grep D
vmstat 1 5
iostat -xz 1 5


✅ Final Cheat Sheet

ToolBest for
topLive CPU/memory
psExact process & state
vmstatKernel & wait queues
iostatDisk latency & saturation

🎯 Never trust a single tool
Real diagnosis = correlation

Troubleshooting CPU - I/O : Best linux ps Command Arguments for Troubleshooting - One‑Command “Master View” (Highly Recommended)

 

One‑Command “Master View” (Highly Recommended)

ps -eo user,pid,ppid,stat,%cpu,%mem,etime,lstart,wchan,comm --sort=-%cpu

🔍 What it shows (and why it matters)

FieldWhy it’s important
userWho owns the process
pidProcess ID
ppidParent process (helps detect orphans)
statProcess state (R/S/D/Z/T)
%cpuCPU consumption
%memMemory usage
etimeHow long the process has been running
lstartExact start time
wchanKernel wait channel (I/O diagnosis)
commExecutable name
--sort=-%cpuTop CPU consumers first

This is your best single snapshot for general troubleshooting


2️⃣ CPU Troubleshooting (High CPU / Run Queues)

ps -eo pid,ppid,stat,psr,pri,ni,%cpu,time,comm --sort=-%cpu | head -20

Key columns

ColumnMeaning
psrWhich CPU core it’s running on
priKernel priority
niNice value
timeTotal CPU time consumed

✅ Use when:

  • Load average is high
  • CPU is saturated
  • Performance complaints

3️⃣ I/O Troubleshooting (MOST CRITICAL)

🔥 Identify blocked processes (D state)

ps -eo pid,stat,wchan,%cpu,etime,comm | awk '$2 ~ /D/'

Why this is powerful

FieldPurpose
DUninterruptible sleep (I/O wait)
wchanWhat kernel function it’s stuck on
etimeHow long it has been blocked

Common wchan values and meaning

wchanMeaning
io_scheduleDisk I/O wait
wait_on_page_bitMemory/disk interaction
nfs_waitNFS hang
blk_mq_get_tagStorage queue congestion

🚨 If Oracle or DB processes appear here → storage issue almost guaranteed


4️⃣ Memory & Leak Detection

ps -eo pid,ppid,stat,rss,vsz,%mem,comm --sort=-rss | head -20

Key fields

FieldMeaning
rssReal memory in KB
vszVirtual memory
%memRAM usage

✅ Use when:

  • System is swapping
  • OOM killer events
  • Slow performance despite low CPU

5️⃣ Full Command, Arguments & Environment

ps -eo pid,stat,%cpu,%mem,cmd --sort=-%cpu

Why this matters:

  • cmd shows complete arguments
  • Crucial for:
    • Java tuning
    • Oracle startup flags
    • Application misconfiguration

6️⃣ Zombie Process Detection

ps -eo pid,ppid,stat,etime,comm | awk '$3 ~ /Z/'

Why care?

  • Zombies indicate parent process bug
  • Can exhaust PID space
  • Need parent restart (not kill)

7️⃣ Oracle / Database‑Focused View (DBA Favorite)

ps -eo pid,stat,%cpu,%mem,etime,wchan,comm | grep ora_

✅ Detects:

  • DBWR / LGWR I/O stalls
  • Parallel worker hangs
  • Backup‑related blockages

8️⃣ Thread‑Level Analysis (Advanced CPU Debugging)

ps -eLo pid,lwp,stat,%cpu,psr,comm --sort=-%cpu

Use when:

  • Java or Oracle shows high CPU
  • Need hot thread detection
  • Correlating with perf / jstack

9️⃣ Parent‑Child Relationship Analysis

ps -eo pid,ppid,stat,etime,comm --forest

✅ Great for:

  • Detecting fork storms
  • Tracing hung parent processes
  • Understanding service trees

10️⃣ Minimal “Health Check” Command (Quick & Safe)

ps -eo pid,stat,%cpu,%mem,etime,comm --sort=-%cpu | head -15

✅ Safe for production
✅ Quick triage
✅ Covers 80% of issues


🔑 What to Focus On (Cheat Sheet)

SymptomLook at
High load%cpu, R state
Stuck systemD state, wchan
Slowness%cpu, %mem, etime
Hung DBora_* + D
Memory issuesrss, %mem
Defunct processesZ

✅ Final Recommendation (What to Remember)

If you remember only ONE command, make it this:

ps -eo user,pid,ppid,stat,%cpu,%mem,etime,wchan,comm --sort=-%cpu

This single command gives: ✅ CPU
✅ I/O
✅ Memory
✅ State
✅ Ownership
✅ Runtime
✅ Kernel wait reason

Troubleshoot storage I/O performance issue -- Linux , Oracle


Explain end‑to‑end explanation of the command :

    ps -eo pid,stat,comm | grep D


This is a process inspection command used heavily by Linux, Unix, and database administrators for system and performance troubleshooting.


1️⃣ What is ps?

ps stands for Process Status.
It reports information about currently running processes on a Linux system.

Think of it as a snapshot of processes at the moment you run the command.

📌 Unlike top or htop, ps:

  • Is not interactive
  • Shows a point‑in‑time view
  • Is ideal for scripting and diagnostics

2️⃣ Command Breakdown

ps -eo pid,stat,comm

Let’s split it into parts:


🔹 ps

Invokes the process status utility.


🔹 -e option (select processes)

-e

Means:
Show all processes running on the system

Without -e, ps would only show processes tied to the current terminal (TTY).

Equivalent options:

ps -e
ps -A

All mean “every process”.


🔹 -o option (custom output format)

-o pid,stat,comm

Means:
Choose which columns to display

Instead of default columns, you explicitly request:

FieldMeaning
pidProcess ID
statProcess state
commCommand name (executable)

This is extremely useful for focused troubleshooting.


3️⃣ Output Columns (Explained in Depth)

🔸 PID — Process ID

Example:

24567
  • Unique identifier for a process
  • Assigned by the Linux kernel
  • Required to manage or inspect processes

Used in commands like:

kill 24567
strace -p 24567
cat /proc/24567/status

📌 Notes:

  • PID 1 is always the init/systemd process
  • PIDs are reused after processes exit

🔸 STAT — Process State (most important field)

The STAT column shows:

  1. Main execution state
  2. Additional flags

Primary states

CodeMeaning
RRunning or runnable (on CPU or ready)
SSleeping (waiting for event)
DUninterruptible sleep (I/O wait)
TStopped (signal or debugger)
ZZombie (dead, not cleaned up)
IIdle kernel thread (newer kernels)

👉 The first letter is the core state.


Modifier flags (can appear after the main letter)

FlagMeaning
sSession leader
lMultithreaded (uses threads)
+Foreground process
<High priority
NLow priority

STAT examples explained

Ss
  • S → sleeping
  • s → session leader
    ✅ Normal background service
Ssl+
  • Sleeping
  • Session leader
  • Multithreaded
  • Foreground task
    ✅ Common for DB or Java processes
D

🚨 Critical

  • Process waiting on kernel I/O
  • Cannot be killed (even kill -9)
  • Usually due to:
    • Disk I/O
    • NFS
    • SAN / ASM
    • Kernel storage issue

Examples

Ss

→ Sleeping, session leader

D

→ Blocked on I/O (disk, NFS, storage). Very important state

Ssl+

→ Sleeping, session leader, multithreaded, foreground job

📌 Critical note
If a process is in D state, it:

  • Cannot be killed (kill -9 won’t work)
  • Is usually waiting on disk, SAN, ASM, or NFS
  • Indicates storage or kernel-level issues


🔸 COMMAND — Executable Name

Example:

oracle
sshd
ora_w00l
  • Shows only the binary name
  • Does NOT include command‑line arguments

For full command line:

ps -eo pid,stat,cmd

📌 Oracle example:

ora_w00l

Means:

  • ora_ → Oracle process
  • w00l → Parallel/worker process

4️⃣ Sample Output and Interpretation

PID STAT COMMAND
1 Ss systemd
1023 Ssl oracle
2045 D ora_dbw0

How to read this:

  • systemd → sleeping session leader (normal)
  • oracle → sleeping, multithreaded (normal)
  • ora_dbw0D state (problem)
    → Indicates disk or ASM issue

5️⃣ Why this command is widely used

✅ Lightweight and fast

  • No interactive overhead
  • Safe on production systems

✅ Perfect for troubleshooting

  • Detects:
    • Hung processes
    • Storage stalls
    • Zombie accumulation
    • Oracle background issues

✅ Script‑friendly

Used inside:

  • Shell scripts
  • Health checks
  • Cron jobs

6️⃣ Common Enhancements

Show only blocked (D) processes

ps -eo pid,stat,comm | awk '$2 ~ /D/'


Sort by process state

ps -eo pid,stat,comm --sort=stat

Add user and CPU usage

ps -eo user,pid,stat,%cpu,%mem,comm


7️⃣ Practical Use Case (Oracle / DB servers)

DBAs frequently use:

ps -eo pid,stat,comm | grep ora_

To detect:

  • Stuck background workers
  • DBWR/LGWR waiting on disk
  • Parallel query stalls

If many ora_* processes show D: 🚨 Storage team must be involved immediately


✅ Final Summary

ComponentPurpose
psShow process snapshot
-eInclude all processes
-oCustomize output
pidProcess identifier
statExecution + wait state
commExecutable name

🎯 Key troubleshooting signal

  • R, S → Normal
  • DI/O or kernel problem
  • Z → Parent process issue


  • PID → Unique process identifier
  • STAT → Current state + extra flags (critical for troubleshooting)
  • COMMAND → Executable name

🎯 For troubleshooting:

  • R / S → Normal
  • DInvestigate immediately
  • Z → Parent process issue

ACE Associate

  ACE Associate