Wednesday, April 15, 2026

How Availability Numbers Are “Massaged” in SLAs ?

 

1. How Availability Numbers Are “Massaged” in SLAs

99.9999% is usually not measured the way engineers think it is.

Vendors almost never measure true end‑to‑end availability.


1.1 The Raw Formula (What Engineers Assume)

Availability=Total TimeDowntimeTotal Time

For 99.9999%, downtime budget:

  • 31.5 seconds / year

1.2 What SLAs Quietly EXCLUDE (Very Important)

Most SLAs exclude downtime caused by:

Excluded CategoryExamples
Planned maintenancePSU patches, GI upgrades
Customer actionsBad SQL, dropped tables
Dependency failuresNetwork, DNS, IAM
DR testsSwitchover drills
Partial outagesOne node down but cluster “up”
Performance degradationSlow ≠ down

📌 Result:
The SLA uptime looks amazing, while users still experience outages.


2. “Availability of What?” (Classic SLA Trick)

SLA usually measures:

✅ Database process running

Business measures:

Transaction success

These are not the same.


Example

SituationSLA ViewUser View
RAC node evictionDB is UPUsers get errors
GC contentionDB UPApp timing out
ADG apply lagPrimary UPData inconsistent
App pool exhaustionDB UPSystem down

📌 Availability ≠ Usability


3. Mapping Oracle Events to Downtime Consumption (Realistic)

Let’s assume a 99.9999% target (31.5 sec/year).


3.1 Oracle RAC Events

EventTypical ImpactDowntime Budget Burn
Instance crash5–30 secYearly budget gone
Node eviction20–60 secSLA violated
CRS restart1–3 minSLA blown
Cache reconfigurationMilliseconds–secondsDaily budget gone

✅ RAC improves availability
❌ RAC alone cannot hold six‑nines


3.2 Data Guard / FSFO Events

EventTime
FSFO detection5–10 sec
Failover execution10–30 sec
App reconnect5–20 sec

🔴 Total: 20–60 seconds
🔴 Already exceeds 99.9999% annual allowance


3.3 Planned Events (Usually “Excluded”)

ActivityReal Impact
Rolling patchLatency spikes
SwitchoverSession drops
Backup I/OPerformance dip

Yet SLAs say: “No downtime occurred.”


4. Why Six‑Nines+ Stops Being a DB Metric

Once you cross five‑nines, availability is dominated by:

  • Application retry logic
  • Connection pool behavior
  • Graceful error handling
  • Client perception

📌 At this level, DB uptime is necessary but insufficient.


5. Correct Way to Measure Availability (Mature Orgs)

Instead of raw uptime, elite teams measure:

MetricWhy It Matters
Successful transactions %Real availability
Mean error rateUser impact
RTO (seconds)Recovery speed
RPO (zero/near-zero)Data safety
Error‑free deploymentsOps maturity

6. Architect‑Grade Statement (Use This)

You can safely say in reviews or audits:

“Availability percentages above five‑nines are typically achieved by excluding planned maintenance and partial failures. For stateful databases like Oracle, true end‑to‑end availability should be measured using transaction success and recovery objectives rather than SLA uptime alone.”


7. Executive Translation (Very Powerful)

“The system may technically be ‘up’, but availability is defined by whether customers can complete transactions without errors.”


8. Final Mental Model (Remember This)

99.9%     → Infrastructure resilience
99.99%    → Platform resilience
99.999%   → Automation maturity
99.9999%+ → Application experience

No comments:

Post a Comment

How Availability Numbers Are “Massaged” in SLAs ?

  1. How Availability Numbers Are “Massaged” in SLAs 99.9999% is usually not measured the way engineers think it is. Vendors almost never me...