Tuesday, May 26, 2026

Approach to correlate AWR + iostat to deep drive and troubleshoot oracle database performance issue

✅ 1. Objective

👉 Correlate:

OS layer → iostat (disk bottleneck)
DB layer → AWR (SQL causing I/O)

👉 Goal: Identify which SQL caused the spike on dm- disks*

🧠 ✅ 2. Correlation Logic (Core Concept)

Disk spike (iostat)
      ↓
Time window
      ↓
AWR snapshot
      ↓
Top IO SQL
      ↓
Execution plan
      ↓
Root cause

⏱️ ✅ 3. STEP 1: Identify Spike Time from iostat

Example:

dm-69 → 100% util
await → 97 ms

👉 Note:

Exact time window (e.g.)

10:02 AM – 10:10 AM

📊 ✅ 4. STEP 2: Find Matching AWR Snapshot

SELECT snap_id, begin_interval_time, end_interval_time
FROM dba_hist_snapshot
ORDER BY snap_id DESC;

👉 Pick snapshots covering spike:

Snap 101 → 10:00
Snap 102 → 10:10

🔥 ✅ 5. STEP 3: Identify Top SQL by I/O

✅ Query 1: Top Disk Read SQL

executions_delta,
disk_reads_delta,
buffer_gets_delta,
elapsed_time_delta/1000000 elapsed_sec,
ROUND(disk_reads_delta/DECODE(executions_delta,0,1,executions_delta)) reads_per_exec
FROM dba_hist_sqlstat
WHERE snap_id BETWEEN :snap1 AND :snap2
ORDER BY disk_reads_delta DESC
WHERE ROWNUM <= 10;

✅ Query 2: Top Read Throughput

disk_reads_delta,
buffer_gets_delta,
rows_processed_delta,
elapsed_time_delta/1000000 elapsed_sec
FROM dba_hist_sqlstat
WHERE snap_id BETWEEN :snap1 AND :snap2
ORDER BY disk_reads_delta DESC
WHERE ROWNUM <= 10;

✅ Query 3: Full Table Scan Candidates

SELECT sql_id,
disk_reads_delta,
executions_delta,
ROUND(disk_reads_delta/DECODE(executions_delta,0,1,executions_delta)) reads_per_exec
FROM dba_hist_sqlstat
WHERE snap_id BETWEEN :snap1 AND :snap2
AND disk_reads_delta > 100000
ORDER BY disk_reads_delta DESC;

🔍 ✅ 6. STEP 4: Identify Wait Events

SELECT event, total_waits_delta, time_waited_delta/1000 time_waited_ms
FROM dba_hist_system_event
WHERE snap_id BETWEEN :snap1 AND :snap2
AND event LIKE 'db file%'
ORDER BY time_waited_ms DESC;

✅ Interpretation:

Wait Event	Meaning
db file scattered read	Full table scan 🚨
db file sequential read	Index access
direct path read	Big scans (DW)
db file parallel read	Parallel scans

📈 ✅ 7. STEP 5: Map SQL → Execution Plan

FROM dba_hist_sql_plan
WHERE sql_id = '&sql_id'
ORDER BY id;

✅ Look for:

TABLE ACCESS FULL 🚨
INDEX RANGE SCAN
FULL OUTER JOIN
Parallel operations

🔗 ✅ 8. STEP 6: Correlate with Disk Pattern

🔴 Case 1 (Your Earlier Example)

iostat:
avgrq-sz ~ 1024 KB
high rMB/s
high await

👉 AWR shows:

db file scattered read
direct path read

✅ Conclusion: Large full table scans causing disk saturation

🟢 Case 2

small avgrq-sz
high IOPS
low await

👉 AWR:

db file sequential read

✅ Conclusion: Healthy OLTP workload

🧪 ✅ 9. Advanced Correlation Query (BEST ONE)

t.sql_text,
s.executions_delta,
s.disk_reads_delta,
ROUND(s.disk_reads_delta/DECODE(s.executions_delta,0,1,s.executions_delta)) reads_per_exec,
s.elapsed_time_delta/1000000 elapsed_sec
FROM dba_hist_sqlstat s,
dba_hist_sqltext t
WHERE s.sql_id = t.sql_id
AND s.snap_id BETWEEN :snap1 AND :snap2
ORDER BY s.disk_reads_delta DESC
FETCH FIRST 10 ROWS ONLY;

🎯 ✅ 10. Root Cause Identification Matrix

iostat Pattern	AWR Finding	Root Cause
Large IO + high latency	scattered read	Full table scan
High read MB/s	direct path read	Large query
High write latency	log file sync	Commit bottleneck
High queue	many active sessions	Concurrency overload

🚨 ✅ 11. Real Example (Like Your Case)

iostat

%util = 100
await = 97 ms
avgqu-sz = 24
rMB/s = high

AWR

Event: db file scattered read
SQL_ID: abc123xyz
Disk Reads: very high
Plan: TABLE ACCESS FULL

✅ FINAL ROOT CAUSE

👉 A large SQL doing full table scan
👉 Saturating disk -> causing high latency

✅ ✅ 12. Fix Strategy

✅ SQL Level

Add indexes
Rewrite query
Avoid full scans

✅ DB Level

Increase buffer cache
Enable parallel limits

✅ Storage Level

Spread datafiles
Use faster disk tier

✅ ✅ 13. Ultimate One-Liner Workflow (Production)

iostat spike → note time
→ find AWR snapshot
→ find top IO SQL
→ check wait events
→ review execution plan
→ fix SQL

Storage Disk Performance Baseline Table to troubleshoot the performance issue

✅ Disk Performance Baseline Table (iostat -xm)

📊 1. Latency (Most Important)

Metric	Good ✅	Warning ⚠️	Critical 🚨	Notes
`await` (ms)	< 5	5 – 20	> 50	Total latency (queue + service)
`r_await`	< 5	5 – 20	> 50	Read latency
`w_await`	< 5	5 – 20	> 50	Write latency

📊 2. Disk Utilization

Metric	Good ✅	Warning ⚠️	Critical 🚨	Notes
`%util`	< 70%	70–90%	> 90%	High alone is OK if latency is low

📊 3. Queue Depth (Pressure Indicator)

Metric	Good ✅	Warning ⚠️	Critical 🚨	Notes
`avgqu-sz`	< 1	1 – 5	> 10	Queue waiting to be served

📊 4. Service Time vs Wait Time

Pattern	Interpretation
`await ≈ svctm`	✅ Healthy (no queueing)
`await >> svctm`	🚨 Queue bottleneck

📊 5. Throughput (rMB/s, wMB/s)

For modern systems (SSD / SAN / NVMe)

Metric	Good ✅	Warning ⚠️	Critical 🚨
Read throughput	< 70% of max capacity	70–90%	> 90% sustained
Write throughput	Same as above	Same	Same

👉 Absolute value depends on storage type:

HDD: ~100–200 MB/s
SSD: ~500 MB/s – 2 GB/s
NVMe: 2–5+ GB/s

📊 6. IOPS (r/s, w/s)

Workload	Typical Healthy Range
OLTP (random IO)	1K – 50K IOPS
DW / Analytics	Lower IOPS, higher throughput

👉 Key rule:

High IOPS + low latency = ✅ good
High IOPS + high latency = 🚨 bottleneck

📊 7. IO Size (avgrq-sz)

Value	Meaning	Health
< 32 KB	Random IO (OLTP)	✅
64–256 KB	Mixed	✅
~512 KB – 1 MB	Sequential scan	⚠️ if causing latency

🎯 ✅ Quick Decision Matrix

Condition	Verdict
High %util + low await (<5ms)	✅ Healthy
High %util + high await (>50ms)	🚨 Bottleneck
High queue (>10)	🚨 Overloaded
Low util + high await	⚠️ Storage issue
Large IO + high latency	⚠️ Scan / DW workload

📌 ✅ DBA-Focused Interpretation

Pattern	Root Cause
High rMB/s + large avgrq-sz	Full table scans
High r/s + small IO	Index access
High w_await	Log/write issue
High avgqu-sz	Storage saturation
High await everywhere	Storage slow

🔥 ✅ Golden Rules (Use in Production)

✅ Healthy Disk

%util < 80
await < 10 ms
avgqu-sz < 3

⚠️ Warning Zone

%util > 80
await 10–30 ms
avgqu-sz 3–10

🚨 Critical Disk Bottleneck

%util > 90
await > 50 ms
avgqu-sz > 10
await >> svctm

✅ ✅ Example Applied to Your Earlier Data

Disk	Verdict
dm-xx (await ~97 ms, util 100%)	🚨 Critical
dm-xxx (queue 40, await 72 ms)	🚨 Severe
dm-xxx (await 1.5 ms, util 99%)	✅ Healthy

Save as `disk_health_score.sh`

#!/bin/bash

echo "===== Disk Health Score ====="

date

echo ""

iostat -xm 2 3 | awk '

function score(util, await, queue) {

s = 100

# Util penalty

if (util > 90) s -= 25

else if (util > 70) s -= 10

# Await penalty

if (await > 50) s -= 50

else if (await > 20) s -= 30

else if (await > 5) s -= 15

# Queue penalty

if (queue > 10) s -= 40

else if (queue > 5) s -= 20

else if (queue > 1) s -= 10

if (s < 0) s = 0

return s

}

function status(s) {

if (s >= 80) return "HEALTHY"

else if (s >= 60) return "WARNING"

else if (s >= 40) return "DEGRADED"

else return "CRITICAL"

}

/Device/ {

printf "%-10s %-6s %-8s %-8s %-6s\n","Device","Util%","Await","Queue","Status"

}

$1 ~ /^(sd|dm)/ {

util = $NF

await = $(NF-3)

queue = $(NF-4)

s = score(util, await, queue)

st = status(s)

printf "%-10s %-6.1f %-8.1f %-8.1f %-6s\n",$1,util,await,queue,st

}

chmod +x disk_health_score.sh

./disk_health_score.sh

Sample Output

Device     Util%  Await    Queue    Status
dm-xx      100.0  97.2     24.3     CRITICAL
dm-xxx     99.9   72.4     40.5     CRITICAL
dm-xx      99.9   80.0     7.2      DEGRADED
dm-xxx     99.4   1.5      11.6     WARNING

Map Actual storage disk mount point to troubleshoot the storage related performance issue

🔗 1. Map `dm-*` → Actual Mount Points (VERY IMPORTANT)

✅ Command:

lsblk -o NAME,KNAME,MOUNTPOINT,SIZE,FSTYPE | grep dm-

✅ If using LVM:

dmsetup ls --tree

✅ Detailed mapping:

lsblk -o NAME,KNAME,PKNAME,MOUNTPOINT | column -t

✅ Correlate with filesystem:

df -h | grep /dev/mapper

👉 Why this matters

dm-62 → logical volume → mount point → DB datafile location
Helps answer:
“Which tablespace is causing this spike?”

🔗 2. Map Disk → Oracle / DB Files

✅ For Oracle:

SELECT file_name, tablespace_name
FROM dba_data_files
WHERE file_name LIKE '%<mount_point>%';

✅ Check temp / redo:

SELECT name FROM v$tempfile;

SELECT member FROM v$logfile;

👉 Now you can map:

dm-69 → /u02 → USERS tablespace → full scan

📊 3. iostat → DB Wait Event Mapping

iostat Pattern	DB Wait Event	Meaning
High rMB/s + large avgrq-sz	`db file scattered read`	Full table scan
High r/s small IO	`db file sequential read`	Index lookup
High w_await	`log file sync`	Commit latency
High avgqu-sz	`free buffer waits`	DB buffer pressure
High await	Any IO wait	Storage slow

🚨 4. Alert Thresholds (Production Standard)

✅ Disk Health Thresholds

Metric	Warning	Critical
%util	>80%	>90%
await	>20 ms	>50 ms
avgqu-sz	>5	>10
svctm vs await gap	noticeable	large gap

✅ Quick Alert Command

iostat -xm 2 5 | awk '
/Device/ {print; next}
$1 ~ /^(sd|dm)/ && ($NF > 90 || $10 > 50 || $9 > 10) {print}

👉 Triggers when:

%util > 90
await > 50 ms
queue > 10

📈 5. Real-Time Monitoring Script (Reusable)

✅ Save as `disk_monitor.sh`

#!/bin/bash
echo "==== Disk Bottleneck Check ===="
iostat -xm 2 3 | awk '
/Device/ {print; next}
$1 ~ /^(sd|dm)/ && ($NF > 90 || $10 > 50 || $9 > 10) {
printf "ALERT: %-8s util=%s%% await=%sms queue=%s\n", $1, $NF, $10, $9

✅ Run:

chmod +x disk_monitor.sh

./disk_monitor.sh

🔍 6. Identify Top IO Consumers

✅ Process level:

iotop -oP

✅ File level:

lsof | grep <device>

✅ Per process IO:

pidstat -d 2

🧠 7. Advanced DBA Analysis Flow

When you see:

%util = 100
await = high
avgqu-sz = high

✅ Follow this sequence:

Identify disk
Map to mount point
Map to DB file
Identify SQL causing IO
Check execution plan

⚡ 8. Quick Root Cause Patterns

🔴 Pattern 1 (Your case earlier)

avgrq-sz ~ 1024 KB
await ~ 80–100 ms

👉 Cause:

Full table scans
Data warehouse queries

🟢 Pattern 2

r/s high + avgrq-sz small + low await

👉 Cause:

OLTP workload (healthy)

🔴 Pattern 3

w_await high

👉 Cause:

Commit issues
Log sync bottleneck

🎯 9. What You Should Do Immediately (From Your Data)

Based on your earlier output:

🚨 Critical disks:

dm-69
dm-275

✅ Action Plan:

Map these disks → mount point
Identify DB objects
Run:

SELECT sql_id, executions, disk_reads
ORDER BY disk_reads DESC FETCH FIRST 10 ROWS ONLY;

✅ Final Takeaway

You now have:

✅ Disk → Mount → DB mapping
✅ Alert thresholds
✅ Real-time monitoring script
✅ DB wait correlation
✅ Troubleshooting workflow

iostat linux command deep drive to troubleshooting the performance issue

iostat -xm 2 5 | awk '$1 ~ /^(sd|dm)/ && $NF > 40 {printf "%-10s %s\n",$1,$NF"%"}'

iostat -xm 2 5 | awk '$NF > 40 {print}'

iostat -xm 2 5 | awk '/Device/ {print; next}$1 ~ /^(sd|dm)/ && $NF > 90 {print}'

📌 Header Breakdown (Deep Explanation)

Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await r_await w_await svctm %util

✅ 1. Device

Logical or physical disk name
- sdX → physical disks
- dm-X → device mapper (LVM, ASM, multipath)

👉 In your case:

dm-* = logical volumes / DB storage layers

✅ 2. rrqm/s (Read Requests Merged per second)

Number of read requests merged by OS scheduler

Why merging matters:

OS combines adjacent reads to reduce I/O calls

👉 Example:

10 small reads → merged → 1 large read

✅ Interpretation:

High value → efficient sequential I/O
Zero → either random I/O or already optimized

✅ 3. wrqm/s (Write Requests Merged per second)

Same as above but for writes

✅ High value:

Good for sequential writes (e.g., redo logs, batch loads)

✅ 4. r/s (Reads per second)

Number of read I/O operations per second

Interpretation:

High r/s = high IOPS (random access likely)

✅ 5. w/s (Writes per second)

Number of write operations per second

👉 Together with r/s:

Indicates workload type:
- OLTP → high r/s + w/s, small IO
- Analytics → lower r/s but large I/O size

✅ 6. rMB/s (Read throughput in MB/sec)

Total data read per second

✅ 7. wMB/s (Write throughput in MB/sec)

Total data written per second

🔎 Important:

Pattern	Meaning
High r/s + low rMB/s	small random IO
Low r/s + high rMB/s	large sequential IO

✅ 8. avgrq-sz (Average Request Size)

Average size of each I/O request (in KB)

Formula:

avgrq-sz = (total sectors read+written) / total I/O ops

Interpretation:

Value	Meaning
< 32 KB	random IO (OLTP)
64–256 KB	mixed
~1024 KB (1MB)	sequential scan

✅ 9. avgqu-sz (Average Queue Length)

Number of I/O requests waiting in queue

🚨 Critical metric:

Value	Impact
< 1	healthy
1–5	moderate
10+	pressure
20+	severe bottleneck

👉 High value means:

Disk is overloaded
Requests are waiting → latency increase

✅ 10. await (Average Wait Time in ms)

Total time for I/O request:
```
wait time = queue time + service time
```

🚨 Thresholds:

Value	Meaning
< 5 ms	excellent
5–20 ms	acceptable
20–50 ms	warning
> 50 ms	serious issue

👉 This is the most important latency metric

✅ 11. r_await (Read latency)

Avg time for read requests

✅ 12. w_await (Write latency)

Avg time for write requests

Why split matters:

Helps identify:
- read-heavy issues (full scan)
- write bottlenecks (redo/log/file sync)

✅ 13. svctm (Service Time)

Time taken by disk to service request
Does NOT include queue time

Important:

await ≈ svctm + queue delay

Interpretation:

Case	Meaning
await ≈ svctm	no queue bottleneck
await >> svctm	queue contention

👉 This is key for bottleneck detection

✅ 14. %util (Utilization)

Percentage of time disk was busy

🚨 Interpretation:

Value	Meaning
< 60%	safe
60–80%	moderate
80–90%	high
> 90%	saturated

👉 BUT:

Must combine with await + queue

🔥 Important Combined Interpretation

✅ Case 1 (Healthy high usage)

%util = 95%
await = 1 ms
avgqu-sz = 1

✔ Efficient disk

🚨 Case 2 (Bottleneck)

%util = 99%
await = 80 ms
avgqu-sz = 20

❌ Disk saturation + queue buildup

🧠 How You Should Read Header (DBA Cheat Sheet)

Step-by-step analysis:

Check %util
- 90 → possible saturation
Check avgqu-sz
- High → queue backlog
Check await
- Confirms latency impact
Compare await vs svctm
- Big gap → queue delay
Check avgrq-sz
- Understand workload type

🎯 Why This Matters for You (Database Architect)

This header directly helps identify:

✅ DB Issues Mapping

Metric	DB Problem
High rMB/s + large avgrq-sz	full table scan
High r/s, low size	index lookup
High w_await	commit / redo issues
High avgqu-sz	storage contention
High await	slow queries

✅ Final Summary

r/s, w/s → IOPS
rMB/s, wMB/s → throughput
avgrq-sz → IO size (random vs sequential)
avgqu-sz → pressure indicator 🚨
await → real latency 🚨
%util → saturation signal

Tuesday, May 26, 2026

Approach to correlate AWR + iostat to deep drive and troubleshoot oracle database performance issue

🧠 ✅ 2. Correlation Logic (Core Concept)

⏱️ ✅ 3. STEP 1: Identify Spike Time from iostat

📊 ✅ 4. STEP 2: Find Matching AWR Snapshot

🔥 ✅ 5. STEP 3: Identify Top SQL by I/O

✅ Query 1: Top Disk Read SQL

✅ Query 2: Top Read Throughput

✅ Query 3: Full Table Scan Candidates

🔍 ✅ 6. STEP 4: Identify Wait Events

✅ Interpretation:

📈 ✅ 7. STEP 5: Map SQL → Execution Plan

✅ Look for:

🔗 ✅ 8. STEP 6: Correlate with Disk Pattern

🔴 Case 1 (Your Earlier Example)

🟢 Case 2

🧪 ✅ 9. Advanced Correlation Query (BEST ONE)

🎯 ✅ 10. Root Cause Identification Matrix

🚨 ✅ 11. Real Example (Like Your Case)

iostat

AWR

✅ FINAL ROOT CAUSE

✅ ✅ 12. Fix Strategy

✅ SQL Level

✅ DB Level

✅ Storage Level

✅ ✅ 13. Ultimate One-Liner Workflow (Production)

Storage Disk Performance Baseline Table to troubleshoot the performance issue

📊 1. Latency (Most Important)

📊 2. Disk Utilization

📊 3. Queue Depth (Pressure Indicator)

📊 4. Service Time vs Wait Time

📊 5. Throughput (rMB/s, wMB/s)

For modern systems (SSD / SAN / NVMe)

📊 6. IOPS (r/s, w/s)

📊 7. IO Size (avgrq-sz)

🎯 ✅ Quick Decision Matrix

📌 ✅ DBA-Focused Interpretation

🔥 ✅ Golden Rules (Use in Production)

✅ Healthy Disk

⚠️ Warning Zone

🚨 Critical Disk Bottleneck

✅ ✅ Example Applied to Your Earlier Data

Save as disk_health_score.sh

Sample Output

Map Actual storage disk mount point to troubleshoot the storage related performance issue

🔗 1. Map dm-* → Actual Mount Points (VERY IMPORTANT)

✅ Command:

✅ If using LVM:

✅ Detailed mapping:

✅ Correlate with filesystem:

🔗 2. Map Disk → Oracle / DB Files

✅ For Oracle:

✅ Check temp / redo:

📊 3. iostat → DB Wait Event Mapping

🚨 4. Alert Thresholds (Production Standard)

✅ Disk Health Thresholds

✅ Quick Alert Command

📈 5. Real-Time Monitoring Script (Reusable)

✅ Save as disk_monitor.sh

✅ Run:

🔍 6. Identify Top IO Consumers

✅ Process level:

✅ File level:

✅ Per process IO:

🧠 7. Advanced DBA Analysis Flow

✅ Follow this sequence:

⚡ 8. Quick Root Cause Patterns

🔴 Pattern 1 (Your case earlier)

🟢 Pattern 2

🔴 Pattern 3

🎯 9. What You Should Do Immediately (From Your Data)

🚨 Critical disks:

✅ Action Plan:

✅ Final Takeaway

iostat linux command deep drive to troubleshooting the performance issue

📌 Header Breakdown (Deep Explanation)

✅ 1. Device

✅ 2. rrqm/s (Read Requests Merged per second)

Why merging matters:

Save as `disk_health_score.sh`

🔗 1. Map `dm-*` → Actual Mount Points (VERY IMPORTANT)

✅ Save as `disk_monitor.sh`