Tuesday, May 26, 2026

Is CPU issue ? troubleshooting workflow for oracle database performance issue with automation

✅ ✅ 1. CPU Troubleshooting Framework (Like iostat for CPU)

Use:

vmstat 2 5

or

top


📊 ✅ 2. CPU Metrics Explained (vmstat / top)

us sy id wa st
MetricMeaning
usUser CPU (app/DB queries)
sySystem CPU (kernel/syscalls)
idIdle CPU
waIO wait
stSteal (VM contention)

✅ ✅ 3. CPU Baseline Table (Production Standard)

📊 CPU Utilization

MetricGood ✅Warning ⚠️Critical 🚨
idle (id)> 40%20–40%< 20%
user (us)< 50%50–70%> 70%
system (sy)< 20%20–30%> 30%
iowait (wa)< 5%5–15%> 20%
steal (st)01–5> 5

📊 Run Queue (Very Important)

r (run queue)
MetricGood ✅Warning ⚠️Critical 🚨
r vs CPU cores≤ cores1.5x cores2x+ cores

✅ ✅ 4. CPU Health Decision Matrix

ConditionMeaning
High usCPU-heavy queries
High sykernel/system overhead
High wanot CPU → disk issue
High rCPU contention
High stVM issue

✅ ✅ 5. CPU Scoring Model (Like Disk)

🎯 Formula

Score = 100 
- user penalty 
- system penalty 
- run queue penalty 
- iowait penalty

📊 Penalties

✅ User CPU

  • <50 → 0
  • 50–70 → -15
  • 70 → -30


✅ System CPU

  • <20 → 0
  • 20–30 → -15
  • 30 → -30


✅ Run Queue

  • ≤ cores → 0
  • 1–2x cores → -20
  • 2x cores → -40


✅ IO Wait

  • <5 → 0
  • 5–15 → -20
  • 15 → -40


✅ Final Status

ScoreStatus
80–100✅ Healthy
60–80⚠️ Warning
40–60🔶 Degraded
<40🚨 Critical

✅ ✅ 6. CPU Monitoring Script (PRODUCTION READY)

📜 cpu_health.sh


#!/bin/bash

echo "===== CPU Health Check ====="
date

vmstat 2 3 | tail -1 | awk '
{
r=$1
us=$13
sy=$14
id=$15
wa=$16
st=$17

score=100

# user penalty
if (us > 70) score -= 30
else if (us > 50) score -= 15

# system penalty
if (sy > 30) score -= 30
else if (sy > 20) score -= 15

# iowait penalty
if (wa > 15) score -= 40
else if (wa > 5) score -= 20

# status
status="HEALTHY"
if (score < 40) status="CRITICAL"
else if (score < 60) status="DEGRADED"
else if (score < 80) status="WARNING"

printf "RunQ=%d User=%d%% Sys=%d%% Idle=%d%% IOwait=%d%% Status=%s Score=%d\n",r,us,sy,id,wa,status,score
}
'

Identify CPU Bottleneck (Root Cause)


✅ Step 1: Check top processes

top -o %CPU

✅ Step 2: Per process CPU

ps -eo pid,ppid,cmd,%mem,%cpu --sort=-%cpu | head

✅ Step 3: Thread-level (VERY IMPORTANT for DB)

top -H

✅ Step 4: Per process breakdown

pidstat -u 2

✅ ✅ 8. CPU Issue Patterns (DBA Mapping)


🔴 Pattern 1: High USER CPU

us > 70%

👉 Cause:

  • Complex SQL
  • Full scans
  • bad execution plans

✅ DB Mapping:

SELECT sql_id, cpu_time
FROM v$sql
ORDER BY cpu_time DESC FETCH FIRST 10 ROWS;


🔴 Pattern 2: High SYSTEM CPU

sy > 30%

👉 Cause:

  • excessive context switching
  • kernel overhead
  • I/O interrupts

✅ Check:

vmstat 1

Look at:

cs (context switches)
in (interrupts)

🔴 Pattern 3: High RUN QUEUE

r >> CPU cores

👉 Cause:

  • CPU contention
  • too many concurrent queries

🔴 Pattern 4: High IO WAIT

wa > 20%

👉 Cause:

  • NOT CPU problem
    👉 disk bottleneck (go back to iostat)

🔴 Pattern 5: High STEAL (VM)

st > 5%

👉 Cause:

  • host contention (cloud/VM issue)

✅ ✅ 9. CPU Alert Script (Automation)

📜 cpu_alert.sh

#!/bin/bash

HOST=$(hostname)

ALERT=$(vmstat 2 2 | tail -1 | awk '
{
us=$13; sy=$14; wa=$16; st=$17

if (us>70 || sy>30 || wa>15 || st>5)
print "CPU Issue: user="us"% sys="sy"% wa="wa"% st="st"%"
}
')

if [ ! -z "$ALERT" ]; then
echo "$ALERT" | mail -s "CPU Alert on $HOST" your_email@company.com
fi
``

✅ ✅ 10. DBA Troubleshooting Flow


🔥 When CPU alert triggers:

1. Check:

vmstat 1 5

2. Identify pattern:

  • High us → SQL issue
  • High sy → OS/kernel
  • High wa → disk issue

3. Identify process:

top

4. Map to DB:

SELECT sql_id, cpu_time FROM v$sql ORDER BY cpu_time DESC;

5. Fix:

  • Tune query ✅
  • Index ✅
  • Reduce parallelism ✅
  • Limit sessions ✅

✅ ✅ 11. Golden Rules (CPU)

✅ Healthy

idle > 40%
wa < 5%
r <= cores

🚨 Critical CPU

idle < 10%
us > 70%
r > 2x cores

⚠️ Trick Case (VERY IMPORTANT)

CPU looks high BUT wa is high
👉 NOT CPU issue → DISK issue

🎯 Final Outcome

Now you have:

✅ CPU baseline table
✅ Health scoring model
✅ Monitoring script
✅ Alerting system
✅ DB correlation
✅ Troubleshooting flow

Is storage issue ? troubleshooting workflow for oracle database performance issue with automation

 

✅ 1. AUTO EMAIL ALERT SCRIPT (Real Production Use)

This script:

  • Detects disk bottlenecks
  • Sends email alert automatically

📜 Script: disk_alert_mail.sh

#!/bin/bash

HOST=$(hostname)
DATE=$(date)

ALERT=$(iostat -xm 2 2 | awk '
$1 ~ /^(sd|dm)/ {
util=$NF
await=$(NF-3)
queue=$(NF-4)

if (util>95 || await>50 || queue>10) {
printf "%-8s util=%s%% await=%sms queue=%s\n", $1, util, await, queue
}
}
')

if [ ! -z "$ALERT" ]; then
echo "Disk Alert on $HOST at $DATE

Critical disks detected:

$ALERT
" | mail -s "🚨 Disk I/O Alert on $HOST" your_email@company.com
fi


✅ Setup Cron (every 5 min)

crontab -e

Add:

*/5 * * * * /path/disk_alert_mail.sh


✅ ✅ 2. AUTO MAP dm-* → MOUNT → DB FILE (Game-Changer)

This solves your biggest pain: 👉 “Which DB object is causing this?”


📜 Script: disk_to_db_map.sh

#!/bin/bash

echo "==== Disk → Mount → Files ===="

for dev in $(lsblk -dn -o NAME | grep dm-); do
echo ""
echo "Device: $dev"

mount=$(lsblk -n -o MOUNTPOINT /dev/$dev 2>/dev/null)

if [ ! -z "$mount" ]; then
echo "Mount: $mount"
echo "Files:"
lsof +D $mount 2>/dev/null | head -10
else
echo "No mount (maybe ASM/LVM raw)"
fi
done


✅ Output Example

Device: dm-xx
Mount: /u02
Files:
oracle  db  datafile1.dbf
oracle  db  users01.dbf

✅ ✅ 3. CORRELATE iostat + ORACLE (Find exact SQL)


✅ Step 1: Identify heavy disk (from iostat)

Example:

dm-69 → /u02 → USERS tablespace

✅ Step 2: Find SQL causing reads

SELECT sql_id,
executions,
disk_reads,
buffer_gets,
ROUND(disk_reads/executions) avg_reads
FROM v$sql
ORDER BY disk_reads DESC
FETCH FIRST 10 ROWS ONLY;


✅ Step 3: Active session (real-time)

SELECT sid, serial#, sql_id, event
FROM v$session
WHERE wait_class='User I/O';


✅ Step 4: Wait event mapping

Wait EventMeaning
db file scattered readFull table scan 🚨
db file sequential readIndex lookup
direct path readLarge scan
log file syncCommit slow

✅ ✅ 4. ADVANCED ALL-IN-ONE SCRIPT (MOST POWERFUL)

👉 Combines:

  • Disk health
  • Alert detection
  • Mapping

📜 ultimate_disk_analyzer.sh

#!/bin/bash

echo "===== Ultimate Disk Analyzer ====="
date
echo ""

iostat -xm 2 2 | awk '
/Device/ {
printf "%-8s %-6s %-6s %-6s %-10s\n","Dev","Util","Await","Queue","Status"
next
}

$1 ~ /^(sd|dm)/ {
util=$NF
await=$(NF-3)
queue=$(NF-4)

status="OK"

if (util>95 || await>50 || queue>10)
status="CRITICAL"
else if (util>80 || await>20 || queue>5)
status="WARNING"

printf "%-8s %-6.1f %-6.1f %-6.1f %-10s\n",$1,util,await,queue,status
}'


✅ ✅ 5. Real DBA Troubleshooting Flow (What YOU should do)


🔥 Step-by-step when alert triggers:

1. Run:

iostat -xm 2 5

2. Identify:

  • High %util
  • High await
  • High queue

3. Map:

lsblk
df -h

4. Check DB:

SELECT * FROM v$session WHERE event LIKE '%read%';


5. Fix:

  • Query tuning ✅
  • Indexing ✅
  • Move data ✅
  • Balance disks ✅

✅ ✅ 6. PRO TIPS (REAL WORLD)

🔴 When to PANIC

  • await > 100 ms
  • queue > 20
  • util = 100% sustained

🟢 When it's OK

  • util = 100% BUT await < 5
    👉 (fast SSD under load)

🧠 Golden formula:

Problem = High Util + High Await + High Queue

🎯 Final Outcome

You now have:

✅ Auto alert system
✅ Disk → DB mapping
✅ SQL root cause detection
✅ Health scoring
✅ Full troubleshooting workflow

Approach to correlate AWR + iostat to deep drive and troubleshoot oracle database performance issue

 

✅ 1. Objective

👉 Correlate:

  • OS layer → iostat (disk bottleneck)
  • DB layer → AWR (SQL causing I/O)

👉 Goal: Identify which SQL caused the spike on dm- disks*


🧠 ✅ 2. Correlation Logic (Core Concept)

Disk spike (iostat)
      ↓
Time window
      ↓
AWR snapshot
      ↓
Top IO SQL
      ↓
Execution plan
      ↓
Root cause

⏱️ ✅ 3. STEP 1: Identify Spike Time from iostat

Example:

dm-69 → 100% util
await → 97 ms

👉 Note:

  • Exact time window (e.g.)
10:02 AM – 10:10 AM

📊 ✅ 4. STEP 2: Find Matching AWR Snapshot

SELECT snap_id, begin_interval_time, end_interval_time
FROM dba_hist_snapshot
ORDER BY snap_id DESC;

👉 Pick snapshots covering spike:

Snap 101 → 10:00
Snap 102 → 10:10

🔥 ✅ 5. STEP 3: Identify Top SQL by I/O

✅ Query 1: Top Disk Read SQL

SELECT *
FROM (
SELECT
sql_id,
executions_delta,
disk_reads_delta,
buffer_gets_delta,
elapsed_time_delta/1000000 elapsed_sec,
ROUND(disk_reads_delta/DECODE(executions_delta,0,1,executions_delta)) reads_per_exec
FROM dba_hist_sqlstat
WHERE snap_id BETWEEN :snap1 AND :snap2
ORDER BY disk_reads_delta DESC
)
WHERE ROWNUM <= 10;


✅ Query 2: Top Read Throughput

SELECT *
FROM (
SELECT
sql_id,
disk_reads_delta,
buffer_gets_delta,
rows_processed_delta,
elapsed_time_delta/1000000 elapsed_sec
FROM dba_hist_sqlstat
WHERE snap_id BETWEEN :snap1 AND :snap2
ORDER BY disk_reads_delta DESC
)
WHERE ROWNUM <= 10;


✅ Query 3: Full Table Scan Candidates

SELECT sql_id,
disk_reads_delta,
executions_delta,
ROUND(disk_reads_delta/DECODE(executions_delta,0,1,executions_delta)) reads_per_exec
FROM dba_hist_sqlstat
WHERE snap_id BETWEEN :snap1 AND :snap2
AND disk_reads_delta > 100000
ORDER BY disk_reads_delta DESC;


🔍 ✅ 6. STEP 4: Identify Wait Events

SELECT event, total_waits_delta, time_waited_delta/1000 time_waited_ms
FROM dba_hist_system_event
WHERE snap_id BETWEEN :snap1 AND :snap2
AND event LIKE 'db file%'
ORDER BY time_waited_ms DESC;


✅ Interpretation:

Wait EventMeaning
db file scattered readFull table scan 🚨
db file sequential readIndex access
direct path readBig scans (DW)
db file parallel readParallel scans

📈 ✅ 7. STEP 5: Map SQL → Execution Plan

SELECT *
FROM dba_hist_sql_plan
WHERE sql_id = '&sql_id'
ORDER BY id;

✅ Look for:

  • TABLE ACCESS FULL 🚨
  • INDEX RANGE SCAN
  • FULL OUTER JOIN
  • Parallel operations

🔗 ✅ 8. STEP 6: Correlate with Disk Pattern


🔴 Case 1 (Your Earlier Example)

iostat:
avgrq-sz ~ 1024 KB
high rMB/s
high await

👉 AWR shows:

db file scattered read
direct path read

✅ Conclusion: Large full table scans causing disk saturation


🟢 Case 2

small avgrq-sz
high IOPS
low await

👉 AWR:

db file sequential read

✅ Conclusion: Healthy OLTP workload


🧪 ✅ 9. Advanced Correlation Query (BEST ONE)


SELECT
s.sql_id,
t.sql_text,
s.executions_delta,
s.disk_reads_delta,
ROUND(s.disk_reads_delta/DECODE(s.executions_delta,0,1,s.executions_delta)) reads_per_exec,
s.elapsed_time_delta/1000000 elapsed_sec
FROM dba_hist_sqlstat s,
dba_hist_sqltext t
WHERE s.sql_id = t.sql_id
AND s.snap_id BETWEEN :snap1 AND :snap2
ORDER BY s.disk_reads_delta DESC
FETCH FIRST 10 ROWS ONLY;


🎯 ✅ 10. Root Cause Identification Matrix

iostat PatternAWR FindingRoot Cause
Large IO + high latencyscattered readFull table scan
High read MB/sdirect path readLarge query
High write latencylog file syncCommit bottleneck
High queuemany active sessionsConcurrency overload

🚨 ✅ 11. Real Example (Like Your Case)

iostat

%util = 100
await = 97 ms
avgqu-sz = 24
rMB/s = high

AWR

Event: db file scattered read
SQL_ID: abc123xyz
Disk Reads: very high
Plan: TABLE ACCESS FULL

✅ FINAL ROOT CAUSE

👉 A large SQL doing full table scan
👉 Saturating disk -> causing high latency


✅ ✅ 12. Fix Strategy

✅ SQL Level

  • Add indexes
  • Rewrite query
  • Avoid full scans

✅ DB Level

  • Increase buffer cache
  • Enable parallel limits

✅ Storage Level

  • Spread datafiles
  • Use faster disk tier

✅ ✅ 13. Ultimate One-Liner Workflow (Production)

iostat spike → note time
→ find AWR snapshot
→ find top IO SQL
→ check wait events
→ review execution plan
→ fix SQL

Storage Disk Performance Baseline Table to troubleshoot the performance issue

 

Disk Performance Baseline Table (iostat -xm)

📊 1. Latency (Most Important)

MetricGood ✅Warning ⚠️Critical 🚨Notes
await (ms)< 55 – 20> 50Total latency (queue + service)
r_await< 55 – 20> 50Read latency
w_await< 55 – 20> 50Write latency

📊 2. Disk Utilization

MetricGood ✅Warning ⚠️Critical 🚨Notes
%util< 70%70–90%> 90%High alone is OK if latency is low

📊 3. Queue Depth (Pressure Indicator)

MetricGood ✅Warning ⚠️Critical 🚨Notes
avgqu-sz< 11 – 5> 10Queue waiting to be served

📊 4. Service Time vs Wait Time

PatternInterpretation
await ≈ svctm✅ Healthy (no queueing)
await >> svctm🚨 Queue bottleneck

📊 5. Throughput (rMB/s, wMB/s)

For modern systems (SSD / SAN / NVMe)

MetricGood ✅Warning ⚠️Critical 🚨
Read throughput< 70% of max capacity70–90%> 90% sustained
Write throughputSame as aboveSameSame

👉 Absolute value depends on storage type:

  • HDD: ~100–200 MB/s
  • SSD: ~500 MB/s – 2 GB/s
  • NVMe: 2–5+ GB/s

📊 6. IOPS (r/s, w/s)

WorkloadTypical Healthy Range
OLTP (random IO)1K – 50K IOPS
DW / AnalyticsLower IOPS, higher throughput

👉 Key rule:

  • High IOPS + low latency = ✅ good
  • High IOPS + high latency = 🚨 bottleneck

📊 7. IO Size (avgrq-sz)

ValueMeaningHealth
< 32 KBRandom IO (OLTP)
64–256 KBMixed
~512 KB – 1 MBSequential scan⚠️ if causing latency

🎯 ✅ Quick Decision Matrix

ConditionVerdict
High %util + low await (<5ms)✅ Healthy
High %util + high await (>50ms)🚨 Bottleneck
High queue (>10)🚨 Overloaded
Low util + high await⚠️ Storage issue
Large IO + high latency⚠️ Scan / DW workload

📌 ✅ DBA-Focused Interpretation

PatternRoot Cause
High rMB/s + large avgrq-szFull table scans
High r/s + small IOIndex access
High w_awaitLog/write issue
High avgqu-szStorage saturation
High await everywhereStorage slow

🔥 ✅ Golden Rules (Use in Production)

✅ Healthy Disk

%util < 80
await < 10 ms
avgqu-sz < 3

⚠️ Warning Zone

%util > 80
await 10–30 ms
avgqu-sz 3–10

🚨 Critical Disk Bottleneck

%util > 90
await > 50 ms
avgqu-sz > 10
await >> svctm

✅ ✅ Example Applied to Your Earlier Data

DiskVerdict
dm-xx (await ~97 ms, util 100%)🚨 Critical
dm-xxx (queue 40, await 72 ms)🚨 Severe
dm-xxx (await 1.5 ms, util 99%)✅ Healthy

Save as disk_health_score.sh

#!/bin/bash

echo "===== Disk Health Score ====="
date
echo ""

iostat -xm 2 3 | awk '

function score(util, await, queue) {
    s = 100

    # Util penalty
    if (util > 90) s -= 25
    else if (util > 70) s -= 10

    # Await penalty
    if (await > 50) s -= 50
    else if (await > 20) s -= 30
    else if (await > 5) s -= 15

    # Queue penalty
    if (queue > 10) s -= 40
    else if (queue > 5) s -= 20
    else if (queue > 1) s -= 10

    if (s < 0) s = 0
    return s
}

function status(s) {
    if (s >= 80) return "HEALTHY"
    else if (s >= 60) return "WARNING"
    else if (s >= 40) return "DEGRADED"
    else return "CRITICAL"
}

/Device/ {
    printf "%-10s %-6s %-8s %-8s %-6s\n","Device","Util%","Await","Queue","Status"
    next
}

$1 ~ /^(sd|dm)/ {
    util = $NF
    await = $(NF-3)
    queue = $(NF-4)

    s = score(util, await, queue)
    st = status(s)

    printf "%-10s %-6.1f %-8.1f %-8.1f %-6s\n",$1,util,await,queue,st
}
'

chmod +x disk_health_score.sh
./disk_health_score.sh

Sample Output 

Device     Util%  Await    Queue    Status
dm-xx      100.0  97.2     24.3     CRITICAL
dm-xxx     99.9   72.4     40.5     CRITICAL
dm-xx      99.9   80.0     7.2      DEGRADED
dm-xxx     99.4   1.5      11.6     WARNING

Is CPU issue ? troubleshooting workflow for oracle database performance issue with automation

✅ ✅ 1. CPU Troubleshooting Framework (Like iostat for CPU) Use: vmstat 2 5 or top 📊 ✅ 2. CPU Metrics Explained (vmstat / top) us sy id wa s...