Tuesday, May 26, 2026

Approach to correlate AWR + iostat to deep drive and troubleshoot oracle database performance issue

 

✅ 1. Objective

๐Ÿ‘‰ Correlate:

  • OS layer → iostat (disk bottleneck)
  • DB layer → AWR (SQL causing I/O)

๐Ÿ‘‰ Goal: Identify which SQL caused the spike on dm- disks*


๐Ÿง  ✅ 2. Correlation Logic (Core Concept)

Disk spike (iostat)
      ↓
Time window
      ↓
AWR snapshot
      ↓
Top IO SQL
      ↓
Execution plan
      ↓
Root cause

⏱️ ✅ 3. STEP 1: Identify Spike Time from iostat

Example:

dm-69 → 100% util
await → 97 ms

๐Ÿ‘‰ Note:

  • Exact time window (e.g.)
10:02 AM – 10:10 AM

๐Ÿ“Š ✅ 4. STEP 2: Find Matching AWR Snapshot

SELECT snap_id, begin_interval_time, end_interval_time
FROM dba_hist_snapshot
ORDER BY snap_id DESC;

๐Ÿ‘‰ Pick snapshots covering spike:

Snap 101 → 10:00
Snap 102 → 10:10

๐Ÿ”ฅ ✅ 5. STEP 3: Identify Top SQL by I/O

✅ Query 1: Top Disk Read SQL

SELECT *
FROM (
SELECT
sql_id,
executions_delta,
disk_reads_delta,
buffer_gets_delta,
elapsed_time_delta/1000000 elapsed_sec,
ROUND(disk_reads_delta/DECODE(executions_delta,0,1,executions_delta)) reads_per_exec
FROM dba_hist_sqlstat
WHERE snap_id BETWEEN :snap1 AND :snap2
ORDER BY disk_reads_delta DESC
)
WHERE ROWNUM <= 10;


✅ Query 2: Top Read Throughput

SELECT *
FROM (
SELECT
sql_id,
disk_reads_delta,
buffer_gets_delta,
rows_processed_delta,
elapsed_time_delta/1000000 elapsed_sec
FROM dba_hist_sqlstat
WHERE snap_id BETWEEN :snap1 AND :snap2
ORDER BY disk_reads_delta DESC
)
WHERE ROWNUM <= 10;


✅ Query 3: Full Table Scan Candidates

SELECT sql_id,
disk_reads_delta,
executions_delta,
ROUND(disk_reads_delta/DECODE(executions_delta,0,1,executions_delta)) reads_per_exec
FROM dba_hist_sqlstat
WHERE snap_id BETWEEN :snap1 AND :snap2
AND disk_reads_delta > 100000
ORDER BY disk_reads_delta DESC;


๐Ÿ” ✅ 6. STEP 4: Identify Wait Events

SELECT event, total_waits_delta, time_waited_delta/1000 time_waited_ms
FROM dba_hist_system_event
WHERE snap_id BETWEEN :snap1 AND :snap2
AND event LIKE 'db file%'
ORDER BY time_waited_ms DESC;


✅ Interpretation:

Wait EventMeaning
db file scattered readFull table scan ๐Ÿšจ
db file sequential readIndex access
direct path readBig scans (DW)
db file parallel readParallel scans

๐Ÿ“ˆ ✅ 7. STEP 5: Map SQL → Execution Plan

SELECT *
FROM dba_hist_sql_plan
WHERE sql_id = '&sql_id'
ORDER BY id;

✅ Look for:

  • TABLE ACCESS FULL ๐Ÿšจ
  • INDEX RANGE SCAN
  • FULL OUTER JOIN
  • Parallel operations

๐Ÿ”— ✅ 8. STEP 6: Correlate with Disk Pattern


๐Ÿ”ด Case 1 (Your Earlier Example)

iostat:
avgrq-sz ~ 1024 KB
high rMB/s
high await

๐Ÿ‘‰ AWR shows:

db file scattered read
direct path read

✅ Conclusion: Large full table scans causing disk saturation


๐ŸŸข Case 2

small avgrq-sz
high IOPS
low await

๐Ÿ‘‰ AWR:

db file sequential read

✅ Conclusion: Healthy OLTP workload


๐Ÿงช ✅ 9. Advanced Correlation Query (BEST ONE)


SELECT
s.sql_id,
t.sql_text,
s.executions_delta,
s.disk_reads_delta,
ROUND(s.disk_reads_delta/DECODE(s.executions_delta,0,1,s.executions_delta)) reads_per_exec,
s.elapsed_time_delta/1000000 elapsed_sec
FROM dba_hist_sqlstat s,
dba_hist_sqltext t
WHERE s.sql_id = t.sql_id
AND s.snap_id BETWEEN :snap1 AND :snap2
ORDER BY s.disk_reads_delta DESC
FETCH FIRST 10 ROWS ONLY;


๐ŸŽฏ ✅ 10. Root Cause Identification Matrix

iostat PatternAWR FindingRoot Cause
Large IO + high latencyscattered readFull table scan
High read MB/sdirect path readLarge query
High write latencylog file syncCommit bottleneck
High queuemany active sessionsConcurrency overload

๐Ÿšจ ✅ 11. Real Example (Like Your Case)

iostat

%util = 100
await = 97 ms
avgqu-sz = 24
rMB/s = high

AWR

Event: db file scattered read
SQL_ID: abc123xyz
Disk Reads: very high
Plan: TABLE ACCESS FULL

✅ FINAL ROOT CAUSE

๐Ÿ‘‰ A large SQL doing full table scan
๐Ÿ‘‰ Saturating disk -> causing high latency


✅ ✅ 12. Fix Strategy

✅ SQL Level

  • Add indexes
  • Rewrite query
  • Avoid full scans

✅ DB Level

  • Increase buffer cache
  • Enable parallel limits

✅ Storage Level

  • Spread datafiles
  • Use faster disk tier

✅ ✅ 13. Ultimate One-Liner Workflow (Production)

iostat spike → note time
→ find AWR snapshot
→ find top IO SQL
→ check wait events
→ review execution plan
→ fix SQL

No comments:

Post a Comment

Is CPU issue ? troubleshooting workflow for oracle database performance issue with automation

✅ ✅ 1. CPU Troubleshooting Framework (Like iostat for CPU) Use: vmstat 2 5 or top ๐Ÿ“Š ✅ 2. CPU Metrics Explained (vmstat / top) us sy id wa s...