✅ 1. Objective
๐ Correlate:
- OS layer →
iostat(disk bottleneck) - DB layer → AWR (SQL causing I/O)
๐ Goal: Identify which SQL caused the spike on dm- disks*
๐ง ✅ 2. Correlation Logic (Core Concept)
Disk spike (iostat)
↓
Time window
↓
AWR snapshot
↓
Top IO SQL
↓
Execution plan
↓
Root cause
⏱️ ✅ 3. STEP 1: Identify Spike Time from iostat
Example:
dm-69 → 100% util
await → 97 ms
๐ Note:
- Exact time window (e.g.)
10:02 AM – 10:10 AM
๐ ✅ 4. STEP 2: Find Matching AWR Snapshot
SELECT snap_id, begin_interval_time, end_interval_time
FROM dba_hist_snapshot
ORDER BY snap_id DESC;
๐ Pick snapshots covering spike:
Snap 101 → 10:00
Snap 102 → 10:10
๐ฅ ✅ 5. STEP 3: Identify Top SQL by I/O
✅ Query 1: Top Disk Read SQL
SELECT *
FROM (
SELECT
sql_id,
executions_delta,
disk_reads_delta,
buffer_gets_delta,
elapsed_time_delta/1000000 elapsed_sec,
ROUND(disk_reads_delta/DECODE(executions_delta,0,1,executions_delta)) reads_per_exec
FROM dba_hist_sqlstat
WHERE snap_id BETWEEN :snap1 AND :snap2
ORDER BY disk_reads_delta DESC
)
WHERE ROWNUM <= 10;
✅ Query 2: Top Read Throughput
SELECT *
FROM (
SELECT
sql_id,
disk_reads_delta,
buffer_gets_delta,
rows_processed_delta,
elapsed_time_delta/1000000 elapsed_sec
FROM dba_hist_sqlstat
WHERE snap_id BETWEEN :snap1 AND :snap2
ORDER BY disk_reads_delta DESC
)
WHERE ROWNUM <= 10;
✅ Query 3: Full Table Scan Candidates
SELECT sql_id,
disk_reads_delta,
executions_delta,
ROUND(disk_reads_delta/DECODE(executions_delta,0,1,executions_delta)) reads_per_exec
FROM dba_hist_sqlstat
WHERE snap_id BETWEEN :snap1 AND :snap2
AND disk_reads_delta > 100000
ORDER BY disk_reads_delta DESC;
๐ ✅ 6. STEP 4: Identify Wait Events
SELECT event, total_waits_delta, time_waited_delta/1000 time_waited_ms
FROM dba_hist_system_event
WHERE snap_id BETWEEN :snap1 AND :snap2
AND event LIKE 'db file%'
ORDER BY time_waited_ms DESC;
✅ Interpretation:
| Wait Event | Meaning |
|---|---|
| db file scattered read | Full table scan ๐จ |
| db file sequential read | Index access |
| direct path read | Big scans (DW) |
| db file parallel read | Parallel scans |
๐ ✅ 7. STEP 5: Map SQL → Execution Plan
SELECT *
FROM dba_hist_sql_plan
WHERE sql_id = '&sql_id'
ORDER BY id;
✅ Look for:
TABLE ACCESS FULL๐จINDEX RANGE SCANFULL OUTER JOIN- Parallel operations
๐ ✅ 8. STEP 6: Correlate with Disk Pattern
๐ด Case 1 (Your Earlier Example)
iostat:
avgrq-sz ~ 1024 KB
high rMB/s
high await
๐ AWR shows:
db file scattered read
direct path read
✅ Conclusion: Large full table scans causing disk saturation
๐ข Case 2
small avgrq-sz
high IOPS
low await
๐ AWR:
db file sequential read
✅ Conclusion: Healthy OLTP workload
๐งช ✅ 9. Advanced Correlation Query (BEST ONE)
SELECT
s.sql_id,
t.sql_text,
s.executions_delta,
s.disk_reads_delta,
ROUND(s.disk_reads_delta/DECODE(s.executions_delta,0,1,s.executions_delta)) reads_per_exec,
s.elapsed_time_delta/1000000 elapsed_sec
FROM dba_hist_sqlstat s,
dba_hist_sqltext t
WHERE s.sql_id = t.sql_id
AND s.snap_id BETWEEN :snap1 AND :snap2
ORDER BY s.disk_reads_delta DESC
FETCH FIRST 10 ROWS ONLY;
๐ฏ ✅ 10. Root Cause Identification Matrix
| iostat Pattern | AWR Finding | Root Cause |
|---|---|---|
| Large IO + high latency | scattered read | Full table scan |
| High read MB/s | direct path read | Large query |
| High write latency | log file sync | Commit bottleneck |
| High queue | many active sessions | Concurrency overload |
๐จ ✅ 11. Real Example (Like Your Case)
iostat
%util = 100
await = 97 ms
avgqu-sz = 24
rMB/s = high
AWR
Event: db file scattered read
SQL_ID: abc123xyz
Disk Reads: very high
Plan: TABLE ACCESS FULL
✅ FINAL ROOT CAUSE
๐ A large SQL doing full table scan
๐ Saturating disk -> causing high latency
✅ ✅ 12. Fix Strategy
✅ SQL Level
- Add indexes
- Rewrite query
- Avoid full scans
✅ DB Level
- Increase buffer cache
- Enable parallel limits
✅ Storage Level
- Spread datafiles
- Use faster disk tier
✅ ✅ 13. Ultimate One-Liner Workflow (Production)
iostat spike → note time
→ find AWR snapshot
→ find top IO SQL
→ check wait events
→ review execution plan
→ fix SQL
No comments:
Post a Comment