Tuesday, May 26, 2026

Storage Disk Performance Baseline Table to troubleshoot the performance issue

 

Disk Performance Baseline Table (iostat -xm)

📊 1. Latency (Most Important)

MetricGood ✅Warning ⚠️Critical 🚨Notes
await (ms)< 55 – 20> 50Total latency (queue + service)
r_await< 55 – 20> 50Read latency
w_await< 55 – 20> 50Write latency

📊 2. Disk Utilization

MetricGood ✅Warning ⚠️Critical 🚨Notes
%util< 70%70–90%> 90%High alone is OK if latency is low

📊 3. Queue Depth (Pressure Indicator)

MetricGood ✅Warning ⚠️Critical 🚨Notes
avgqu-sz< 11 – 5> 10Queue waiting to be served

📊 4. Service Time vs Wait Time

PatternInterpretation
await ≈ svctm✅ Healthy (no queueing)
await >> svctm🚨 Queue bottleneck

📊 5. Throughput (rMB/s, wMB/s)

For modern systems (SSD / SAN / NVMe)

MetricGood ✅Warning ⚠️Critical 🚨
Read throughput< 70% of max capacity70–90%> 90% sustained
Write throughputSame as aboveSameSame

👉 Absolute value depends on storage type:

  • HDD: ~100–200 MB/s
  • SSD: ~500 MB/s – 2 GB/s
  • NVMe: 2–5+ GB/s

📊 6. IOPS (r/s, w/s)

WorkloadTypical Healthy Range
OLTP (random IO)1K – 50K IOPS
DW / AnalyticsLower IOPS, higher throughput

👉 Key rule:

  • High IOPS + low latency = ✅ good
  • High IOPS + high latency = 🚨 bottleneck

📊 7. IO Size (avgrq-sz)

ValueMeaningHealth
< 32 KBRandom IO (OLTP)
64–256 KBMixed
~512 KB – 1 MBSequential scan⚠️ if causing latency

🎯 ✅ Quick Decision Matrix

ConditionVerdict
High %util + low await (<5ms)✅ Healthy
High %util + high await (>50ms)🚨 Bottleneck
High queue (>10)🚨 Overloaded
Low util + high await⚠️ Storage issue
Large IO + high latency⚠️ Scan / DW workload

📌 ✅ DBA-Focused Interpretation

PatternRoot Cause
High rMB/s + large avgrq-szFull table scans
High r/s + small IOIndex access
High w_awaitLog/write issue
High avgqu-szStorage saturation
High await everywhereStorage slow

🔥 ✅ Golden Rules (Use in Production)

✅ Healthy Disk

%util < 80
await < 10 ms
avgqu-sz < 3

⚠️ Warning Zone

%util > 80
await 10–30 ms
avgqu-sz 3–10

🚨 Critical Disk Bottleneck

%util > 90
await > 50 ms
avgqu-sz > 10
await >> svctm

✅ ✅ Example Applied to Your Earlier Data

DiskVerdict
dm-xx (await ~97 ms, util 100%)🚨 Critical
dm-xxx (queue 40, await 72 ms)🚨 Severe
dm-xxx (await 1.5 ms, util 99%)✅ Healthy

Save as disk_health_score.sh

#!/bin/bash

echo "===== Disk Health Score ====="
date
echo ""

iostat -xm 2 3 | awk '

function score(util, await, queue) {
    s = 100

    # Util penalty
    if (util > 90) s -= 25
    else if (util > 70) s -= 10

    # Await penalty
    if (await > 50) s -= 50
    else if (await > 20) s -= 30
    else if (await > 5) s -= 15

    # Queue penalty
    if (queue > 10) s -= 40
    else if (queue > 5) s -= 20
    else if (queue > 1) s -= 10

    if (s < 0) s = 0
    return s
}

function status(s) {
    if (s >= 80) return "HEALTHY"
    else if (s >= 60) return "WARNING"
    else if (s >= 40) return "DEGRADED"
    else return "CRITICAL"
}

/Device/ {
    printf "%-10s %-6s %-8s %-8s %-6s\n","Device","Util%","Await","Queue","Status"
    next
}

$1 ~ /^(sd|dm)/ {
    util = $NF
    await = $(NF-3)
    queue = $(NF-4)

    s = score(util, await, queue)
    st = status(s)

    printf "%-10s %-6.1f %-8.1f %-8.1f %-6s\n",$1,util,await,queue,st
}
'

chmod +x disk_health_score.sh
./disk_health_score.sh

Sample Output 

Device     Util%  Await    Queue    Status
dm-xx      100.0  97.2     24.3     CRITICAL
dm-xxx     99.9   72.4     40.5     CRITICAL
dm-xx      99.9   80.0     7.2      DEGRADED
dm-xxx     99.4   1.5      11.6     WARNING

No comments:

Post a Comment

Is CPU issue ? troubleshooting workflow for oracle database performance issue with automation

✅ ✅ 1. CPU Troubleshooting Framework (Like iostat for CPU) Use: vmstat 2 5 or top 📊 ✅ 2. CPU Metrics Explained (vmstat / top) us sy id wa s...