Monday, April 27, 2026

Oracle Database Disk Storage Slowness Troubleshooting (RHEL) - I/O issue

Oracle Database Disk Storage Slowness Troubleshooting (RHEL)

Command :
ps -eo pid,stat,comm | grep D

Meaning

ps -e → show all processes
-o pid,stat,comm → display:
- pid → process ID
- stat → process state
- comm → command name
grep D → filter processes whose STAT column contains D

What `D` means

D = Uninterruptible sleep
This usually means the process is:

Waiting on I/O
Typically stuck on disk, NFS, SAN, or kernel I/O
Cannot be killed (kill -9 won’t work) until the I/O returns

This is often serious on production systems.

iostat -xz 1 5

`Ss`

This shows the Oracle process state at the time of capture:

S = sleeping
s = secondary sleep state

So the process was waiting (idle or blocked), not crashing at that exact moment.

1. Typical Symptoms (What triggers investigation)

High load average on DB server
User complaints: slow queries, commits, batch delays
AWR shows:
- db file sequential read
- db file scattered read
- log file sync
- log file parallel write
OS metrics show high IO wait (%wa)
RMAN / backups running slow

2. Step‑1: Validate System Load & CPU Wait

✅ Identify load average vs cores

uptime

nproc

Interpretation

Load ≈ number of CPU cores → OK
Load >> cores + high IO wait → likely disk bottleneck

✅ Check CPU & IO wait

top

or (better)

vmstat 1 10

Look for:

%wa (IO wait) consistently > 15–20%
Low %id while CPUs are idle but blocked

Example

r  b   swpd   free  buff cache   si so bi bo   in   cs us sy id wa st
8  12     0   812M  122M  18G     0  0  45 620  900 1200 10  6 40 44 0

➡️ High b and wa = blocked on disk

3. Step‑2: Disk Latency at OS Level (Most Important)

✅ iostat – PRIMARY disk latency tool

iostat -xm 1 10

Key columns:

Metric	Meaning	Problem Threshold
`r_await`	Read latency	> 20 ms (OLTP), > 50 ms (DW)
`w_await`	Write latency	> 10–15 ms
`await`	Avg IO latency	> 20 ms
`%util`	Disk busy	> 80–90% sustained
`aqu-sz`	Avg queue size	Growing steadily = queueing

Example (Bad)

Device:  r/s   w/s  r_await  w_await  await  aqu-sz %util
sdb      420   350   48.12    32.22    40.01   18.3   97.4

➡️ Storage saturation confirmed

4. Step‑3: Identify Which Filesystems / Disks

✅ Map disks → mount points

df -hT

lsblk -f

✅ Per‑filesystem IO usage

iostat -xm 1 10 | grep -E "sd|nvme"

Check:

Datafiles disk
Redo log disk
FRA disk
Temp disk

5. Step‑4: Per‑Process Confirmation (Oracle vs others)

✅ pidstat – correlate Oracle background processes

pidstat -d 1 10 | grep ora_

Key offenders:

ora_dbw* → datafile writes
ora_lgwr → redo log writes
ora_ckpt
RMAN channels

High KB/s + delays = database IO bottleneck

6. Step‑5: Advanced Disk & Queue Observation

✅ sar (historical if available)

sar -d 1 5

✅ IO pressure (RHEL 8+)

cat /proc/pressure/io

If avg10 and avg60 > 10–20 → sustained storage pressure

7. Step‑6: Oracle Database Wait Event Validation

✅ Top waits (Instance level)

SELECT event, total_waits, time_waited/100 AS time_waited_sec

FROM v$system_event

WHERE event LIKE 'db file%'

OR event LIKE 'log file%'

ORDER BY time_waited DESC;

✅ Real‑time waits (active sessions)

SELECT sid, event, wait_time, seconds_in_wait

FROM v$session

WHERE wait_class = 'User I/O'

ORDER BY seconds_in_wait DESC;

8. Step‑7: File Type & Latency inside Oracle

✅ File-level IO latency

SELECT df.name,

fs.phyrds,

fs.phywrts,

fs.readtim/1000 AS read_sec,

fs.writetim/1000 AS write_sec

FROM v$datafile df, v$filestat fs

WHERE df.file# = fs.file#

ORDER BY fs.readtim DESC;

✅ Tablespace hotspot

SELECT tablespace_name,

SUM(physical_reads) reads,

SUM(physical_writes) writes

FROM v$segment_statistics

GROUP BY tablespace_name

ORDER BY reads DESC;

9. Step‑8: Redo Log Latency (Very Common OLTP Issue)

✅ LGWR wait

SELECT event, total_waits, time_waited/100 AS time_waited_sec

FROM v$system_event

WHERE event IN ('log file sync','log file parallel write');

Interpretation

log file sync waits high → commit delayed
log file parallel write high → redo disk slow

✅ Validate redo disks with:

iostat -xm 1 10 <redo_disk>

10. Step‑9: ASM (If Applicable)

✅ ASM disk stats

SELECT name, total_mb, free_mb, read_errs, write_errs

FROM v$asm_disk;

✅ ASM IO latency

SELECT dg.name, fs.*

FROM v$asm_diskgroup dg, v$asm_disk_iostat fs

WHERE dg.group_number = fs.group_number;

11. Correlation Checklist (OS ⇄ Oracle)

Disk problem confirmed if ALL match

High %wa in vmstat
High await in iostat
High db file* or log file* waits in Oracle
%util near 100% on affected disks
Load average high but CPU idle present

12. Common Root Causes

Cause	How it Appears
Storage array saturation	High await + util
Poor redo disk	log file sync waits
Temp spills	db file scattered read
RMAN / backup	DBWn writes spike
Thin provisioning	Latency spikes under load
Too many LUNs on same backend	Random latency

13. Immediate Mitigations

✅ Short‑term:

Pause backups / RMAN
Kill runaway sessions
Reduce parallelism
Move redo logs to faster disks

✅ Medium‑term:

Separate redo, data, temp
Increase redo log size
Add disks / IOPS
ASM rebalance / re‑stripe

✅ Long‑term:

Storage tiering (NVMe for redo)
Oracle I/O calibration (ORION)
Capacity & growth planning

14. One‑Command Quick Triage Bundle

uptime

vmstat 1 5

iostat -xm 1 5

pidstat -d 1 5 | grep ora_

Then Oracle

SELECT event, time_waited/100 AS wait_sec

FROM v$system_event

ORDER BY time_waited DESC;

15. Key Rule of Thumb (Production Oracle)

IO Type	Acceptable Latency
Redo writes	< 5 ms
OLTP reads	< 10–15 ms
Mixed workload	< 20 ms
Anything > 30 ms	Problem

ORACLE DATABASE PROBLEM AND SOLUTIONS

Monday, April 27, 2026