Oracle Database Disk Storage Slowness Troubleshooting (RHEL)
ps -eo pid,stat,comm | grep D
Meaning
ps -e→ show all processes-o pid,stat,comm→ display:pid→ process IDstat→ process statecomm→ command name
grep D→ filter processes whose STAT column containsD
What D means
D = Uninterruptible sleep
This usually means the process is:
- Waiting on I/O
- Typically stuck on disk, NFS, SAN, or kernel I/O
- Cannot be killed (
kill -9won’t work) until the I/O returns
This is often serious on production systems.
iostat -xz 1 5
Ss
This shows the Oracle process state at the time of capture:
S= sleepings= secondary sleep state
So the process was waiting (idle or blocked), not crashing at that exact moment.
1. Typical Symptoms (What triggers investigation)
- High load average on DB server
- User complaints: slow queries, commits, batch delays
- AWR shows:
db file sequential readdb file scattered readlog file synclog file parallel write
- OS metrics show high IO wait (
%wa) - RMAN / backups running slow
2. Step‑1: Validate System Load & CPU Wait
✅ Identify load average vs cores
Interpretation
- Load ≈ number of CPU cores → OK
- Load >> cores + high IO wait → likely disk bottleneck
✅ Check CPU & IO wait
or (better)
vmstat 1 10
Look for:
%wa(IO wait) consistently > 15–20%- Low
%idwhile CPUs are idle but blocked
Example
r b swpd free buff cache si so bi bo in cs us sy id wa st
8 12 0 812M 122M 18G 0 0 45 620 900 1200 10 6 40 44 0
➡️ High b and wa = blocked on disk
3. Step‑2: Disk Latency at OS Level (Most Important)
✅ iostat – PRIMARY disk latency tool
Key columns:
| Metric | Meaning | Problem Threshold |
|---|---|---|
r_await | Read latency | > 20 ms (OLTP), > 50 ms (DW) |
w_await | Write latency | > 10–15 ms |
await | Avg IO latency | > 20 ms |
%util | Disk busy | > 80–90% sustained |
aqu-sz | Avg queue size | Growing steadily = queueing |
Example (Bad)
Device: r/s w/s r_await w_await await aqu-sz %util
sdb 420 350 48.12 32.22 40.01 18.3 97.4
➡️ Storage saturation confirmed
4. Step‑3: Identify Which Filesystems / Disks
✅ Map disks → mount points
✅ Per‑filesystem IO usage
Check:
- Datafiles disk
- Redo log disk
- FRA disk
- Temp disk
5. Step‑4: Per‑Process Confirmation (Oracle vs others)
✅ pidstat – correlate Oracle background processes
Key offenders:
ora_dbw*→ datafile writesora_lgwr→ redo log writesora_ckpt- RMAN channels
High KB/s + delays = database IO bottleneck
6. Step‑5: Advanced Disk & Queue Observation
✅ sar (historical if available)
✅ IO pressure (RHEL 8+)
If avg10 and avg60 > 10–20 → sustained storage pressure
7. Step‑6: Oracle Database Wait Event Validation
✅ Top waits (Instance level)
✅ Real‑time waits (active sessions)
8. Step‑7: File Type & Latency inside Oracle
✅ File-level IO latency
✅ Tablespace hotspot
9. Step‑8: Redo Log Latency (Very Common OLTP Issue)
✅ LGWR wait
Interpretation
log file syncwaits high → commit delayedlog file parallel writehigh → redo disk slow
✅ Validate redo disks with:
iostat -xm 1 10 <redo_disk>
10. Step‑9: ASM (If Applicable)
✅ ASM disk stats
✅ ASM IO latency
11. Correlation Checklist (OS ⇄ Oracle)
Disk problem confirmed if ALL match
- High
%wainvmstat - High
awaitiniostat - High
db file*orlog file*waits in Oracle %utilnear 100% on affected disks- Load average high but CPU idle present
12. Common Root Causes
| Cause | How it Appears |
|---|---|
| Storage array saturation | High await + util |
| Poor redo disk | log file sync waits |
| Temp spills | db file scattered read |
| RMAN / backup | DBWn writes spike |
| Thin provisioning | Latency spikes under load |
| Too many LUNs on same backend | Random latency |
13. Immediate Mitigations
✅ Short‑term:
- Pause backups / RMAN
- Kill runaway sessions
- Reduce parallelism
- Move redo logs to faster disks
✅ Medium‑term:
- Separate redo, data, temp
- Increase redo log size
- Add disks / IOPS
- ASM rebalance / re‑stripe
✅ Long‑term:
- Storage tiering (NVMe for redo)
- Oracle I/O calibration (ORION)
- Capacity & growth planning
14. One‑Command Quick Triage Bundle
Then Oracle
15. Key Rule of Thumb (Production Oracle)
| IO Type | Acceptable Latency |
|---|---|
| Redo writes | < 5 ms |
| OLTP reads | < 10–15 ms |
| Mixed workload | < 20 ms |
| Anything > 30 ms | Problem |
No comments:
Post a Comment