1. How Availability Numbers Are “Massaged” in SLAs
99.9999% is usually not measured the way engineers think it is.
Vendors almost never measure true end‑to‑end availability.
1.1 The Raw Formula (What Engineers Assume)
For 99.9999%, downtime budget:
- 31.5 seconds / year
1.2 What SLAs Quietly EXCLUDE (Very Important)
Most SLAs exclude downtime caused by:
| Excluded Category | Examples |
|---|---|
| Planned maintenance | PSU patches, GI upgrades |
| Customer actions | Bad SQL, dropped tables |
| Dependency failures | Network, DNS, IAM |
| DR tests | Switchover drills |
| Partial outages | One node down but cluster “up” |
| Performance degradation | Slow ≠ down |
📌 Result:
The SLA uptime looks amazing, while users still experience outages.
2. “Availability of What?” (Classic SLA Trick)
SLA usually measures:
✅ Database process running
Business measures:
✅ Transaction success
These are not the same.
Example
| Situation | SLA View | User View |
|---|---|---|
| RAC node eviction | DB is UP | Users get errors |
| GC contention | DB UP | App timing out |
| ADG apply lag | Primary UP | Data inconsistent |
| App pool exhaustion | DB UP | System down |
📌 Availability ≠ Usability
3. Mapping Oracle Events to Downtime Consumption (Realistic)
Let’s assume a 99.9999% target (31.5 sec/year).
3.1 Oracle RAC Events
| Event | Typical Impact | Downtime Budget Burn |
|---|---|---|
| Instance crash | 5–30 sec | Yearly budget gone |
| Node eviction | 20–60 sec | SLA violated |
| CRS restart | 1–3 min | SLA blown |
| Cache reconfiguration | Milliseconds–seconds | Daily budget gone |
✅ RAC improves availability
❌ RAC alone cannot hold six‑nines
3.2 Data Guard / FSFO Events
| Event | Time |
|---|---|
| FSFO detection | 5–10 sec |
| Failover execution | 10–30 sec |
| App reconnect | 5–20 sec |
🔴 Total: 20–60 seconds
🔴 Already exceeds 99.9999% annual allowance
3.3 Planned Events (Usually “Excluded”)
| Activity | Real Impact |
|---|---|
| Rolling patch | Latency spikes |
| Switchover | Session drops |
| Backup I/O | Performance dip |
Yet SLAs say: “No downtime occurred.”
4. Why Six‑Nines+ Stops Being a DB Metric
Once you cross five‑nines, availability is dominated by:
- Application retry logic
- Connection pool behavior
- Graceful error handling
- Client perception
📌 At this level, DB uptime is necessary but insufficient.
5. Correct Way to Measure Availability (Mature Orgs)
Instead of raw uptime, elite teams measure:
| Metric | Why It Matters |
|---|---|
| Successful transactions % | Real availability |
| Mean error rate | User impact |
| RTO (seconds) | Recovery speed |
| RPO (zero/near-zero) | Data safety |
| Error‑free deployments | Ops maturity |
6. Architect‑Grade Statement (Use This)
You can safely say in reviews or audits:
“Availability percentages above five‑nines are typically achieved by excluding planned maintenance and partial failures. For stateful databases like Oracle, true end‑to‑end availability should be measured using transaction success and recovery objectives rather than SLA uptime alone.”
7. Executive Translation (Very Powerful)
“The system may technically be ‘up’, but availability is defined by whether customers can complete transactions without errors.”
8. Final Mental Model (Remember This)
99.9% → Infrastructure resilience
99.99% → Platform resilience
99.999% → Automation maturity
99.9999%+ → Application experience
No comments:
Post a Comment