Wednesday, April 15, 2026

How Availability Numbers Are “Massaged” in SLAs ?

1. How Availability Numbers Are “Massaged” in SLAs

99.9999% is usually not measured the way engineers think it is.

Vendors almost never measure true end‑to‑end availability.

1.1 The Raw Formula (What Engineers Assume)

Availability = \frac{Total Time - Downtime}{Total Time}

For 99.9999%, downtime budget:

31.5 seconds / year

1.2 What SLAs Quietly EXCLUDE (Very Important)

Most SLAs exclude downtime caused by:

Excluded Category	Examples
Planned maintenance	PSU patches, GI upgrades
Customer actions	Bad SQL, dropped tables
Dependency failures	Network, DNS, IAM
DR tests	Switchover drills
Partial outages	One node down but cluster “up”
Performance degradation	Slow ≠ down

📌 Result:
The SLA uptime looks amazing, while users still experience outages.

2. “Availability of What?” (Classic SLA Trick)

SLA usually measures:

✅ Database process running

Business measures:

✅ Transaction success

These are not the same.

Example

Situation	SLA View	User View
RAC node eviction	DB is UP	Users get errors
GC contention	DB UP	App timing out
ADG apply lag	Primary UP	Data inconsistent
App pool exhaustion	DB UP	System down

📌 Availability ≠ Usability

3. Mapping Oracle Events to Downtime Consumption (Realistic)

Let’s assume a 99.9999% target (31.5 sec/year).

3.1 Oracle RAC Events

Event	Typical Impact	Downtime Budget Burn
Instance crash	5–30 sec	Yearly budget gone
Node eviction	20–60 sec	SLA violated
CRS restart	1–3 min	SLA blown
Cache reconfiguration	Milliseconds–seconds	Daily budget gone

✅ RAC improves availability
❌ RAC alone cannot hold six‑nines

3.2 Data Guard / FSFO Events

Event	Time
FSFO detection	5–10 sec
Failover execution	10–30 sec
App reconnect	5–20 sec

🔴 Total: 20–60 seconds
🔴 Already exceeds 99.9999% annual allowance

3.3 Planned Events (Usually “Excluded”)

Activity	Real Impact
Rolling patch	Latency spikes
Switchover	Session drops
Backup I/O	Performance dip

Yet SLAs say: “No downtime occurred.”

4. Why Six‑Nines+ Stops Being a DB Metric

Once you cross five‑nines, availability is dominated by:

Application retry logic
Connection pool behavior
Graceful error handling
Client perception

📌 At this level, DB uptime is necessary but insufficient.

5. Correct Way to Measure Availability (Mature Orgs)

Instead of raw uptime, elite teams measure:

Metric	Why It Matters
Successful transactions %	Real availability
Mean error rate	User impact
RTO (seconds)	Recovery speed
RPO (zero/near-zero)	Data safety
Error‑free deployments	Ops maturity

6. Architect‑Grade Statement (Use This)

You can safely say in reviews or audits:

“Availability percentages above five‑nines are typically achieved by excluding planned maintenance and partial failures. For stateful databases like Oracle, true end‑to‑end availability should be measured using transaction success and recovery objectives rather than SLA uptime alone.”

7. Executive Translation (Very Powerful)

“The system may technically be ‘up’, but availability is defined by whether customers can complete transactions without errors.”

8. Final Mental Model (Remember This)

99.9%     → Infrastructure resilience
99.99%    → Platform resilience
99.999%   → Automation maturity
99.9999%+ → Application experience

ORACLE DATABASE PROBLEM AND SOLUTIONS