✅ Disk Performance Baseline Table (iostat -xm)
📊 1. Latency (Most Important)
| Metric | Good ✅ | Warning ⚠️ | Critical 🚨 | Notes |
|---|---|---|---|---|
await (ms) | < 5 | 5 – 20 | > 50 | Total latency (queue + service) |
r_await | < 5 | 5 – 20 | > 50 | Read latency |
w_await | < 5 | 5 – 20 | > 50 | Write latency |
📊 2. Disk Utilization
| Metric | Good ✅ | Warning ⚠️ | Critical 🚨 | Notes |
|---|---|---|---|---|
%util | < 70% | 70–90% | > 90% | High alone is OK if latency is low |
📊 3. Queue Depth (Pressure Indicator)
| Metric | Good ✅ | Warning ⚠️ | Critical 🚨 | Notes |
|---|---|---|---|---|
avgqu-sz | < 1 | 1 – 5 | > 10 | Queue waiting to be served |
📊 4. Service Time vs Wait Time
| Pattern | Interpretation |
|---|---|
await ≈ svctm | ✅ Healthy (no queueing) |
await >> svctm | 🚨 Queue bottleneck |
📊 5. Throughput (rMB/s, wMB/s)
For modern systems (SSD / SAN / NVMe)
| Metric | Good ✅ | Warning ⚠️ | Critical 🚨 |
|---|---|---|---|
| Read throughput | < 70% of max capacity | 70–90% | > 90% sustained |
| Write throughput | Same as above | Same | Same |
👉 Absolute value depends on storage type:
- HDD: ~100–200 MB/s
- SSD: ~500 MB/s – 2 GB/s
- NVMe: 2–5+ GB/s
📊 6. IOPS (r/s, w/s)
| Workload | Typical Healthy Range |
|---|---|
| OLTP (random IO) | 1K – 50K IOPS |
| DW / Analytics | Lower IOPS, higher throughput |
👉 Key rule:
- High IOPS + low latency = ✅ good
- High IOPS + high latency = 🚨 bottleneck
📊 7. IO Size (avgrq-sz)
| Value | Meaning | Health |
|---|---|---|
| < 32 KB | Random IO (OLTP) | ✅ |
| 64–256 KB | Mixed | ✅ |
| ~512 KB – 1 MB | Sequential scan | ⚠️ if causing latency |
🎯 ✅ Quick Decision Matrix
| Condition | Verdict |
|---|---|
| High %util + low await (<5ms) | ✅ Healthy |
| High %util + high await (>50ms) | 🚨 Bottleneck |
| High queue (>10) | 🚨 Overloaded |
| Low util + high await | ⚠️ Storage issue |
| Large IO + high latency | ⚠️ Scan / DW workload |
📌 ✅ DBA-Focused Interpretation
| Pattern | Root Cause |
|---|---|
| High rMB/s + large avgrq-sz | Full table scans |
| High r/s + small IO | Index access |
| High w_await | Log/write issue |
| High avgqu-sz | Storage saturation |
| High await everywhere | Storage slow |
🔥 ✅ Golden Rules (Use in Production)
✅ Healthy Disk
%util < 80
await < 10 ms
avgqu-sz < 3
⚠️ Warning Zone
%util > 80
await 10–30 ms
avgqu-sz 3–10
🚨 Critical Disk Bottleneck
%util > 90
await > 50 ms
avgqu-sz > 10
await >> svctm
✅ ✅ Example Applied to Your Earlier Data
| Disk | Verdict |
|---|---|
| dm-xx (await ~97 ms, util 100%) | 🚨 Critical |
| dm-xxx (queue 40, await 72 ms) | 🚨 Severe |
| dm-xxx (await 1.5 ms, util 99%) | ✅ Healthy |
Save as disk_health_score.sh
disk_health_score.sh
#!/bin/bash
echo "===== Disk Health Score ====="
date
echo ""
iostat -xm 2 3 | awk '
function score(util, await, queue) {
s = 100
# Util penalty
if (util > 90) s -= 25
else if (util > 70) s -= 10
# Await penalty
if (await > 50) s -= 50
else if (await > 20) s -= 30
else if (await > 5) s -= 15
# Queue penalty
if (queue > 10) s -= 40
else if (queue > 5) s -= 20
else if (queue > 1) s -= 10
if (s < 0) s = 0
return s
}
function status(s) {
if (s >= 80) return "HEALTHY"
else if (s >= 60) return "WARNING"
else if (s >= 40) return "DEGRADED"
else return "CRITICAL"
}
/Device/ {
printf "%-10s %-6s %-8s %-8s %-6s\n","Device","Util%","Await","Queue","Status"
next
}
$1 ~ /^(sd|dm)/ {
util = $NF
await = $(NF-3)
queue = $(NF-4)
s = score(util, await, queue)
st = status(s)
printf "%-10s %-6.1f %-8.1f %-8.1f %-6s\n",$1,util,await,queue,st
}
'
chmod +x disk_health_score.sh
./disk_health_score.sh
Sample Output
Device Util% Await Queue Status
dm-xx 100.0 97.2 24.3 CRITICAL
dm-xxx 99.9 72.4 40.5 CRITICAL
dm-xx 99.9 80.0 7.2 DEGRADED
dm-xxx 99.4 1.5 11.6 WARNING
No comments:
Post a Comment