Monday, April 27, 2026

Oracle Database Disk Storage Slowness Troubleshooting (RHEL) - I/O issue

 

Oracle Database Disk Storage Slowness Troubleshooting (RHEL)




Command :
ps -eo pid,stat,comm | grep D

Meaning

  • ps -e → show all processes
  • -o pid,stat,comm → display:
    • pid → process ID
    • stat → process state
    • comm → command name
  • grep D → filter processes whose STAT column contains D

What D means

D = Uninterruptible sleep
This usually means the process is:

  • Waiting on I/O
  • Typically stuck on disk, NFS, SAN, or kernel I/O
  • Cannot be killed (kill -9 won’t work) until the I/O returns

This is often serious on production systems.


iostat -xz 1 5


Ss

This shows the Oracle process state at the time of capture:

  • S = sleeping
  • s = secondary sleep state

So the process was waiting (idle or blocked), not crashing at that exact moment.



1. Typical Symptoms (What triggers investigation)

  • High load average on DB server
  • User complaints: slow queries, commits, batch delays
  • AWR shows:
    • db file sequential read
    • db file scattered read
    • log file sync
    • log file parallel write
  • OS metrics show high IO wait (%wa)
  • RMAN / backups running slow

2. Step‑1: Validate System Load & CPU Wait

✅ Identify load average vs cores

uptime
nproc

Interpretation

  • Load ≈ number of CPU cores → OK
  • Load >> cores + high IO wait → likely disk bottleneck

✅ Check CPU & IO wait

top

or (better)

vmstat 1 10

Look for:

  • %wa (IO wait) consistently > 15–20%
  • Low %id while CPUs are idle but blocked

Example

r  b   swpd   free  buff cache   si so bi bo   in   cs us sy id wa st
8  12     0   812M  122M  18G     0  0  45 620  900 1200 10  6 40 44 0

➡️ High b and wa = blocked on disk


3. Step‑2: Disk Latency at OS Level (Most Important)

✅ iostat – PRIMARY disk latency tool

iostat -xm 1 10

Key columns:

MetricMeaningProblem Threshold
r_awaitRead latency> 20 ms (OLTP), > 50 ms (DW)
w_awaitWrite latency> 10–15 ms
awaitAvg IO latency> 20 ms
%utilDisk busy> 80–90% sustained
aqu-szAvg queue sizeGrowing steadily = queueing

Example (Bad)

Device:  r/s   w/s  r_await  w_await  await  aqu-sz %util
sdb      420   350   48.12    32.22    40.01   18.3   97.4

➡️ Storage saturation confirmed


4. Step‑3: Identify Which Filesystems / Disks

✅ Map disks → mount points

df -hT
lsblk -f

✅ Per‑filesystem IO usage

iostat -xm 1 10 | grep -E "sd|nvme"

Check:

  • Datafiles disk
  • Redo log disk
  • FRA disk
  • Temp disk

5. Step‑4: Per‑Process Confirmation (Oracle vs others)

✅ pidstat – correlate Oracle background processes

pidstat -d 1 10 | grep ora_

Key offenders:

  • ora_dbw* → datafile writes
  • ora_lgwr → redo log writes
  • ora_ckpt
  • RMAN channels

High KB/s + delays = database IO bottleneck


6. Step‑5: Advanced Disk & Queue Observation

✅ sar (historical if available)

sar -d 1 5

✅ IO pressure (RHEL 8+)

cat /proc/pressure/io

If avg10 and avg60 > 10–20 → sustained storage pressure


7. Step‑6: Oracle Database Wait Event Validation

✅ Top waits (Instance level)

SELECT event, total_waits, time_waited/100 AS time_waited_sec
FROM v$system_event
WHERE event LIKE 'db file%'
OR event LIKE 'log file%'
ORDER BY time_waited DESC;


✅ Real‑time waits (active sessions)

SELECT sid, event, wait_time, seconds_in_wait
FROM v$session
WHERE wait_class = 'User I/O'
ORDER BY seconds_in_wait DESC;


8. Step‑7: File Type & Latency inside Oracle

✅ File-level IO latency

SELECT df.name,
fs.phyrds,
fs.phywrts,
fs.readtim/1000 AS read_sec,
fs.writetim/1000 AS write_sec
FROM v$datafile df, v$filestat fs
WHERE df.file# = fs.file#
ORDER BY fs.readtim DESC;

✅ Tablespace hotspot

SELECT tablespace_name,
SUM(physical_reads) reads,
SUM(physical_writes) writes
FROM v$segment_statistics
GROUP BY tablespace_name
ORDER BY reads DESC;

9. Step‑8: Redo Log Latency (Very Common OLTP Issue)

✅ LGWR wait

SELECT event, total_waits, time_waited/100 AS time_waited_sec
FROM v$system_event
WHERE event IN ('log file sync','log file parallel write');

Interpretation

  • log file sync waits high → commit delayed
  • log file parallel write high → redo disk slow

✅ Validate redo disks with:

iostat -xm 1 10 <redo_disk>


10. Step‑9: ASM (If Applicable)

✅ ASM disk stats

SELECT name, total_mb, free_mb, read_errs, write_errs
FROM v$asm_disk;

✅ ASM IO latency

SELECT dg.name, fs.*
FROM v$asm_diskgroup dg, v$asm_disk_iostat fs
WHERE dg.group_number = fs.group_number;


11. Correlation Checklist (OS ⇄ Oracle)

Disk problem confirmed if ALL match

  • High %wa in vmstat
  • High await in iostat
  • High db file* or log file* waits in Oracle
  • %util near 100% on affected disks
  • Load average high but CPU idle present

12. Common Root Causes

CauseHow it Appears
Storage array saturationHigh await + util
Poor redo disklog file sync waits
Temp spillsdb file scattered read
RMAN / backupDBWn writes spike
Thin provisioningLatency spikes under load
Too many LUNs on same backendRandom latency

13. Immediate Mitigations

✅ Short‑term:

  • Pause backups / RMAN
  • Kill runaway sessions
  • Reduce parallelism
  • Move redo logs to faster disks

✅ Medium‑term:

  • Separate redo, data, temp
  • Increase redo log size
  • Add disks / IOPS
  • ASM rebalance / re‑stripe

✅ Long‑term:

  • Storage tiering (NVMe for redo)
  • Oracle I/O calibration (ORION)
  • Capacity & growth planning

14. One‑Command Quick Triage Bundle

uptime
vmstat 1 5
iostat -xm 1 5
pidstat -d 1 5 | grep ora_

Then Oracle

SELECT event, time_waited/100 AS wait_sec
FROM v$system_event
ORDER BY time_waited DESC;



15. Key Rule of Thumb (Production Oracle)

IO TypeAcceptable Latency
Redo writes< 5 ms
OLTP reads< 10–15 ms
Mixed workload< 20 ms
Anything > 30 msProblem

No comments:

Post a Comment

Production Server/Database/Application troubleshooting Runbook for Issue like CPU, Memory, I/o , Kernel

  0️⃣ Runbook Objectives This runbook helps you: ✅ Quickly identify CPU, I/O, memory, or process issues ✅ Correlate OS metrics with database...