Wednesday, May 27, 2026

Oracle/OS huge page help to increase IO - Yes or No ?

 

Short Answer

No — HugePages do not directly increase disk I/O throughput or IOPS.

What they do improve is memory-management efficiency for Oracle’s SGA, which can indirectly improve overall database performance in workloads where CPU overhead, memory translation, and buffer cache efficiency matter.

So the precise answer is:

  • Direct disk I/O improvement: No
  • Indirect database performance improvement: Yes, sometimes
  • Can query response time improve? Yes
  • Can storage subsystem become faster? No

Why the Answer Is “No”

When people say “I/O” in Oracle, they often mix up two different things:

1. Disk I/O

This means:

  • reading from storage
  • writing to storage
  • IOPS
  • throughput (MB/s)
  • latency from disk/ASM/SAN/NVMe

HugePages do not make:

  • disks spin faster
  • SSDs respond faster
  • SAN latency lower
  • ASM perform more physical reads/sec by itself

So if your system is bottlenecked on:

  • slow storage,
  • high read latency,
  • redo write latency,
  • log file sync due to storage,
  • db file sequential/scattered read delays,

then HugePages will not fix that bottleneck.


Then Why Do People Say HugePages Help Performance?

Because Oracle uses HugePages mainly for the SGA (System Global Area), and that changes how efficiently memory is managed, not how storage hardware performs.

HugePages help by reducing:

  • number of memory pages
  • page table overhead
  • TLB misses
  • kernel memory management overhead
  • swapping risk for SGA

This can reduce CPU consumption and improve scalability, especially for large SGAs.

That means:

  • buffer cache access becomes cheaper from a CPU/memory-translation perspective
  • shared pool access can be more efficient
  • the system spends less overhead managing memory mappings

So the database may perform better overall — but that is not the same as disk I/O becoming faster.


The Correct Mental Model

Think of it like this:

Without HugePages

Oracle SGA is built using many small OS pages (typically 4 KB).

With HugePages

Oracle SGA is backed by large pages (typically 2 MB).

This means the OS and CPU can manage Oracle’s large memory regions more efficiently.

But the database blocks are still:

  • 8 KB / 16 KB Oracle blocks,
  • read from the same disks,
  • through the same storage path,
  • at the same storage latency.

So:

HugePages optimize memory handling, not disk mechanics.


Where HugePages Can Indirectly Reduce I/O

Here’s the subtle but important part.

HugePages do not speed up physical I/O directly, but they can help Oracle make better use of memory, and that can reduce the need for physical I/O indirectly in some cases.

Example 1: Large Buffer Cache Works Better

If you have a large SGA and buffer cache, HugePages help Oracle manage that memory more efficiently.

Result:

  • more stable SGA usage
  • lower CPU overhead
  • better scalability under load

If the buffer cache is effective, more reads are served from memory instead of disk.

So you may observe:

  • fewer physical reads
  • better response time
  • lower pressure on storage

But the key point is:

HugePages did not make a single disk read faster.
They helped Oracle use memory better, which may reduce the number of disk reads required.


Example 2: Less Swapping = Less Disaster

If HugePages are not used properly and memory pressure happens, parts of memory management may become inefficient, and in bad configurations there can be swapping-related problems.

HugePages pin SGA memory and prevent it from being swapped.

That means:

  • Oracle buffer cache stays resident
  • shared memory remains stable
  • performance doesn’t collapse due to memory pressure

Again, that is not “faster disk I/O”; it is avoiding performance degradation.


When HugePages Help the Most

HugePages are most useful when:

  • SGA is large (tens or hundreds of GB)
  • Oracle host has many concurrent sessions
  • CPU overhead matters
  • TLB misses are significant
  • page table size is large
  • multiple Oracle instances are running
  • you need stable, predictable performance

In such systems, HugePages can improve:

  • CPU efficiency
  • memory translation overhead
  • latency consistency
  • overall throughput of the DB server

That may look like I/O improvement from the application side because transactions become faster — but the actual storage path has not changed.


When HugePages Will Not Help Much

HugePages usually give little or no visible benefit if:

  • SGA is small
  • workload is purely storage-bound
  • storage latency is the dominant bottleneck
  • system is already under-tuned at SQL/query level
  • bad execution plans are causing excess reads
  • indexes/statistics/partitioning are poor

In those cases, if someone asks:

“Can HugePages increase I/O?”

The right answer is:

No, not in the way you are probably hoping.
You need to fix storage, SQL, or database design.


Direct I/O vs Buffered I/O: Important Distinction

If Oracle uses Direct I/O / ASM

Data may bypass the OS page cache and move between storage and Oracle-managed memory.

HugePages help the Oracle memory side (SGA), but:

  • they still do not change storage latency
  • they still do not increase IOPS capacity

If Oracle uses filesystem buffered I/O

OS page cache uses normal pages, not HugePages in the Oracle HugePages sense.

Again:

  • HugePages help Oracle SGA
  • not the OS page cache I/O path

So in both cases: storage I/O itself does not become faster because HugePages were enabled.


Why Some DBAs Believe HugePages Improved I/O

Because after enabling HugePages they may observe:

  • lower DB CPU usage
  • faster SQL execution
  • fewer stalls
  • lower latency at application level
  • more stable throughput

This leads to the impression:

“HugePages improved I/O”

But the real reason is often:

  1. Oracle memory management became more efficient
  2. SGA became more stable
  3. Buffer cache behavior improved
  4. CPU spent less time on page translation / kernel overhead
  5. More work completed per second

That’s an overall performance gain, not a storage subsystem gain.


Simple Yes/No by Scenario

Scenario A: Slow SAN / slow disk / high read latency

Will HugePages increase disk I/O speed?
No

Scenario B: Very large SGA, CPU overhead high, many TLB misses

Will HugePages improve DB performance?
Yes

Scenario C: Buffer cache becomes more effective and physical reads drop

Did HugePages increase I/O?
No directly
They helped reduce the need for some physical I/O.

Scenario D: Oracle server was unstable due to memory fragmentation/swapping risk

Can HugePages make performance much better?
Yes But again, not because disk became faster.


Best Technical Statement You Can Use

If you want a precise architect-level answer, use this:

HugePages do not directly increase Oracle physical I/O throughput, IOPS, or reduce storage latency. They improve SGA memory efficiency by reducing page table overhead, TLB misses, and swapping risk. This can indirectly improve overall database performance and, in some workloads, reduce physical I/O demand by improving memory utilization and buffer cache effectiveness.


Final Conclusion

Answer:

No, HugePages do not directly increase I/O.

Detailed meaning:

  • Do they increase disk speed?No
  • Do they reduce storage latency?No
  • Do they improve Oracle memory efficiency?Yes
  • Can that indirectly improve DB performance and reduce some physical reads?Yes
  • Should you configure HugePages for large Oracle databases?Absolutely yes, but for memory efficiency and stability, not because they “make I/O faster.”

Practical DBA Rule

If the bottleneck is:

  • storage → fix storage / SQL / access path
  • memory management / large SGA / CPU overhead → HugePages help
  • both → HugePages help one side only (memory side)

Relationship between huge page and OS Page

 

1. Normal OS Page vs HugePage (Fundamentals)

Default Linux memory pages

  • Standard page size: 4 KB
  • Every memory access uses page tables to translate virtual → physical
  • Large SGA → millions of page table entries (PTEs)

HugePages (HugeTLB)

  • Larger page size: typically 2 MB (or 1 GB on some systems)
  • Reduces:
    • Page table entries
    • TLB (Translation Lookaside Buffer) misses
    • CPU overhead

2. How Oracle Uses HugePages Internally

SGA allocation path

When HugePages are enabled (USE_LARGE_PAGES=TRUE|ONLY):

  1. Oracle requests SGA memory during startup
  2. Kernel checks:
    • Are HugePages available (/proc/meminfo)?
  3. If yes:
    • Allocates SGA entirely from HugePages pool
    • Pins memory → not swappable
  4. If not:
    • Falls back to normal 4KB pages (unless ONLY)

Key behavior

  • Only SGA uses HugePages
  • PGA, stack, processes → still use normal pages

3. Interaction with OS Page Cache & IO

This is where most confusion happens.

A. Buffered I/O (filesystem)

  • Uses OS page cache
  • Pages are normal 4KB pages
  • Reads:
    Disk → OS Page Cache → Oracle buffer cache
    

✅ HugePages are NOT used in OS page cache


B. Direct I/O (O_DIRECT / ASM / Filesystem with DirectIO)

  • Bypasses OS page cache
  • Data flows:
Disk → Oracle SGA buffer cache (HugePages)

Important insight:

  • With HugePages:
    • Buffer cache resides in HugePages
    • Data blocks (8 KB, 16 KB) are packed inside HugePages

👉 HugePages improve:

  • Memory lookup efficiency
  • Buffer cache access speed

👉 But they do NOT change I/O block size


4. Relationship: HugePages vs DB Block vs OS Page

LayerTypical Size
Oracle DB block8 KB (default)
OS normal page4 KB
HugePage2 MB

Internal packing

A single HugePage (2 MB) holds:

2 MB / 8 KB = 256 Oracle blocks

So:

  • Oracle still uses 8 KB blocks
  • HugePages just back the memory region

5. Does HugePages Increase IO Performance?

Direct Impact:

❌ Does NOT increase disk IO throughput directly
✅ Reduces CPU overhead during memory access

Indirect Impact:

  • Faster buffer cache lookups
  • Reduced TLB misses
  • Better scalability for large SGA
  • Less kernel overhead → more CPU for DB work

👉 So overall DB performance improves, but IO bandwidth stays same


6. How HugePages Reduce Kernel Overhead

Without HugePages (example 500 GB SGA):

500 GB / 4 KB ≈ 131 million pages

With HugePages (2 MB):

500 GB / 2 MB ≈ 256,000 pages

✅ Reduction:

  • Page table size drastically reduced
  • Fewer TLB entries needed
  • Less CPU spent on memory translation

7. Key Kernel Components Involved

1. TLB (Translation Lookaside Buffer)

  • Cache of virtual→physical mappings
  • HugePages → fewer entries needed → less misses

2. Page Tables

  • Shrink significantly with HugePages

3. Buddy Allocator / HugeTLB pool

  • Pre-reserved memory (vm.nr_hugepages)
  • Prevents fragmentation issues

8. Why HugePages Do NOT Affect OS Page Cache

Because:

  • Page cache is managed by kernel VM subsystem
  • HugePages are:
    • Preallocated
    • Not swappable
    • Not part of regular memory pool

👉 They are isolated memory region


9. How HugePages Size Is Increased

Step 1: Calculate requirement

HugePages = (SGA size) / HugePage size

Example:

SGA = 100 GB
HugePage = 2 MB

100 GB / 2 MB = 51200 pages

Step 2: Configure kernel

vm.nr_hugepages = 51200

Step 3: Reload

sysctl -p

Step 4: Verify

cat /proc/meminfo | grep Huge

Important fields:

HugePages_Total
HugePages_Free
HugePages_Rsvd
HugePages_Surp

10. Advanced Behavior (Very Important)

Partial allocation

If HugePages are insufficient:

  • Oracle may split:
    • Part SGA → HugePages
    • Rest → normal pages ❌ (bad)

👉 Causes:

  • Performance inconsistency
  • Memory fragmentation

✅ Best practice:

USE_LARGE_PAGES=ONLY

11. Summary (Core Concepts)

What HugePages DO

✅ Optimize SGA memory management
✅ Reduce CPU overhead
✅ Improve scalability
✅ Prevent swapping

What HugePages DO NOT DO

❌ Do not change DB block size
❌ Do not increase disk IO speed
❌ Do not affect OS page cache


12. Simple Mental Model

Think of it like this:

  • Regular pages → many small boxes (slow to manage)
  • HugePages → fewer large containers (efficient)

Oracle:

  • Stores data blocks inside containers
  • Still processes blocks same way

13. DBA Insight (Important for you as Architect)

Given your role:

HugePages matter most when:

  • SGA > 16–32 GB
  • High OLTP concurrency
  • CPU contention exists

They matter less when:

  • IO bound system (slow storage)
  • Small SGA

How to Identify and fix Index is Causing Performance Issue in oracle database ?

 

🔎 PART 1 — How to Identify Index is Causing Performance Issue


✅ Step 1: Find Slow SQL (Entry Point)

Start from Top SQL, not from the index.

SELECT sql_id, executions, elapsed_time, buffer_gets
FROM v$sql
ORDER BY elapsed_time DESC FETCH FIRST 10 ROWS ONLY;

👉 Pick the problematic SQL


✅ Step 2: Check Execution Plan

SELECT * FROM TABLE(DBMS_XPLAN.DISPLAY_CURSOR('sql_id', NULL, 'ALLSTATS LAST'));

🔍 Look for:

SymptomMeaning
FULL TABLE SCANIndex missing or ignored ❌
INDEX RANGE SCAN with high costIndex inefficient ❌
INDEX FULL SCANPoor selectivity ❌

✅ Step 3: Check Index Access Cost

In plan, observe:

  • Cost
  • Rows
  • Bytes
  • Buffer Gets

👉 If index scan:

  • Reads too many blocks → Index is not selective

✅ Step 4: Check Clustering Factor (Critical)

SELECT index_name, clustering_factor, num_rows
FROM dba_indexes
WHERE index_name = 'INDEX_NAME';

Interpretation:

  • CF ≈ number of rows → ❌ Random reads (BAD index)
  • CF ≈ number of blocks → ✅ Efficient index

👉 High CF → more I/O → slower query


✅ Step 5: Check Selectivity (Very Important)

SELECT num_distinct, num_rows
FROM dba_tab_col_statistics
WHERE table_name = 'TABLE_NAME'
AND column_name = 'COLUMN_NAME';

👉 Formula:

Selectivity = num_distinct / num_rows

Result:

ValueMeaning
High (close to 1)✅ Good index
Low (<0.1)❌ Bad index

👉 Example:

  • status = 'ACTIVE' → bad index
  • customer_id → good index

✅ Step 6: Check If Index is Used Properly

ALTER INDEX schema.index_name MONITORING USAGE;

Then:

SELECT * FROM v$object_usage WHERE index_name = 'INDEX_NAME';

👉 Result:

  • USED = NO → Index useless
  • USED = YES but slow → Index is problem

✅ Step 7: Check Logical Reads (Performance Impact)

SELECT *
FROM v$segment_statistics
WHERE object_name = 'INDEX_NAME'
AND statistic_name = 'logical reads';

👉 High logical reads = heavy I/O → possible issue


✅ Step 8: Check Index BLEVEL (Depth)

SELECT index_name, blevel
FROM dba_indexes
WHERE index_name = 'INDEX_NAME';

BLEVELMeaning
0–2✅ Good
3–4⚠️ Ok
>4❌ Too deep

✅ Step 9: Validate Stats

EXEC DBMS_STATS.GATHER_INDEX_STATS('SCHEMA','INDEX_NAME');

👉 Bad stats = wrong execution plan


✅ Step 10: Compare With Full Table Scan

Force test:

SELECT /*+ FULL(table_name) */ ..

👉 If FULL scan is faster → Index is hurting performance


🚨 Conclusion (Index Causing Issue If):

  • High clustering factor
  • Low selectivity
  • High logical reads
  • Not used or incorrectly used
  • Execution plan shows inefficiency

✅ PART 2 — How to Check Index is GOOD or BAD


🎯 Step-by-Step Evaluation Framework


✅ Step 1: Usage Check

SELECT * FROM v$object_usage
WHERE index_name = 'INDEX_NAME';

ResultDecision
USED frequently✅ Good
NOT USED❌ Drop candidate

✅ Step 2: Query Alignment

👉 Check if index matches WHERE clause:

WHERE col1 = :1 AND col2 = :2

👉 Good index:

(col1, col2)

👉 Bad index:

(col3, col4)


✅ Step 3: Selectivity Check

  • High distinct values → ✅ Good
  • Low distinct values → ❌ Bad

✅ Step 4: Clustering Factor

  • Low CF → ✅ Good
  • High CF → ❌ Bad

✅ Step 5: Size vs Benefit

SELECT bytes/1024/1024 MB
FROM dba_segments
WHERE segment_name='INDEX_NAME';

👉 Large + rarely used → ❌ Bad


✅ Step 6: DML Overhead Check

👉 Too many indexes on high DML table:

  • Slows INSERT/UPDATE/DELETE
SELECT index_name
FROM dba_indexes
WHERE table_name = 'TABLE_NAME';

👉 More indexes ≠ better


✅ Step 7: Redundancy Check

👉 Duplicate indexes = waste


✅ Step 8: BLEVEL + Leaf Blocks

👉 Small, compact index = good
👉 Huge fragmented index = bad


✅ FINAL DECISION MATRIX

ConditionGood Index ✅Bad Index ❌
UsageFrequentNot used
Execution PlanIndex scanFull scan
SelectivityHighLow
Clustering FactorLowHigh
BLEVEL≤2>4
SizeOptimizedHuge unused
DML ImpactLowHigh

🔥 REAL-WORLD QUICK CHECK SCRIPT


SELECT
i.index_name,
i.blevel,
i.clustering_factor,
i.num_rows,
i.distinct_keys,
s.bytes/1024/1024 MB
FROM dba_indexes i
JOIN dba_segments s
ON i.index_name = s.segment_name
WHERE i.owner = 'SCHEMA_NAME'
ORDER BY s.bytes DESC;


💡 Architect-Level Insight (Important)

👉 Don’t judge index in isolation

Always evaluate:

SQL → Execution Plan → Index → Data Pattern

✅ Recommended Approach (Best Practice)

  1. Identify top SQL (AWR/V$SQL)
  2. Analyze execution plan
  3. Validate index usage
  4. Check clustering factor + selectivity
  5. Decide: KEEP / REBUILD / DROP

INDEX HEALTH CLASSIFICATION SCRIPT


set lines 300 pages 500
col  OWNER  for a10
col  INDEX_NAME  for a50
col TABLE_NAME for a50

WITH idx_stats AS (
    SELECT 
        i.owner,
        i.index_name,
        i.table_name,
        i.blevel,
        i.clustering_factor,
        i.num_rows,
        i.distinct_keys,
        s.bytes/1024/1024 AS size_mb,
        CASE 
            WHEN i.num_rows > 0 
            THEN ROUND(i.distinct_keys / i.num_rows, 4)
            ELSE 0
        END AS selectivity
    FROM dba_indexes i
    JOIN dba_segments s
        ON i.owner = s.owner
        AND i.index_name = s.segment_name
    WHERE i.owner  IN ('RITRS')
AND CLUSTERING_FACTOR >700
),
usage_stats AS (
    SELECT 
        index_name,
        CASE 
            WHEN used = 'YES' THEN 'USED'
            ELSE 'NOT_USED'
        END AS usage_status
    FROM v$object_usage
)
SELECT 
    i.owner,
    i.index_name,
    i.table_name,
    i.size_mb,
    i.blevel,
    i.clustering_factor,
    i.num_rows,
    i.distinct_keys,
    i.selectivity,
    NVL(u.usage_status, 'UNKNOWN') AS usage_status,
    CASE 
        WHEN NVL(u.usage_status, 'NOT_USED') = 'NOT_USED'
            THEN 'DROP_CANDIDATE'
        WHEN i.selectivity < 0.05
            THEN 'BAD_INDEX_LOW_SELECTIVITY'
        WHEN i.clustering_factor > i.num_rows * 0.9
            THEN 'BAD_INDEX_HIGH_CF'
        WHEN i.blevel > 4
            THEN 'REBUILD_REQUIRED'
        WHEN i.size_mb > 500 
             AND NVL(u.usage_status, 'NOT_USED') <> 'USED'
            THEN 'REVIEW_BIG_UNUSED_INDEX'
        ELSE 'GOOD_INDEX'
    END AS index_health_status
FROM idx_stats i
LEFT JOIN usage_stats u
    ON i.index_name = u.index_name
ORDER BY i.size_mb DESC

Alert Fix : 1 index(es) with too many extents in oracle database - oracle performance issue

🔍 What it Means

  • Oracle stores segments (tables, indexes) in extents (contiguous blocks).
  • Over time, due to growth, rebuilds, and DML, an index may end up with:
    • Hundreds or thousands of small extents
  • This is usually flagged when:
extents > threshold (e.g., 1000 or configurable limit)


⚠️ Why It Matters

  • Generally not critical in modern Oracle (ASSM + LMT) environments.
  • But can still cause:
    • Slight performance overhead in segment management
    • Longer backup/restore times
    • Inefficient disk usage in older systems
    • Red flag in audits / compliance reports

👉 In most modern systems, it's more of a hygiene issue than a real performance problem.


✅ Identify the Problem Index

Run:

SELECT owner, segment_name, segment_type, extents
FROM dba_segments
WHERE segment_type = 'INDEX'
AND extents > 500 -- adjust threshold
ORDER BY extents DESC;

Or specifically:

SELECT owner, index_name
FROM dba_indexes
WHERE index_name IN (
SELECT segment_name
FROM dba_segments
WHERE extents > 500
);

🛠️ Fix Options

✅ Option 1: Rebuild the Index (Recommended)

ALTER INDEX schema_name.index_name REBUILD;

Benefits:

  • Reduces number of extents
  • Compacts index
  • Eliminates fragmentation

👉 If large index:

ALTER INDEX schema_name.index_name REBUILD ONLINE;


✅ Option 2: Rebuild with Storage Clause

To control extent allocation:

ALTER INDEX schema_name.index_name REBUILD
STORAGE (INITIAL 100M NEXT 100M);

👉 Helps avoid many small extents in future.


✅ Option 3: Move to Different Tablespace

If fragmentation is due to tablespace issues:


ALTER INDEX schema_name.index_name
REBUILD TABLESPACE new_tablespace;


✅ Option 4: Coalesce (Less Effective)

ALTER INDEX schema_name.index_name COALESCE;
  • Merges adjacent leaf blocks
  • Does NOT reduce extents
  • Safer but less impactful

🧠 Root Cause Analysis

Common reasons:

  • Autoallocate tablespace creating many uneven extents
  • Frequent index growth/shrink cycles
  • Old dictionary-managed tablespaces (DMT)
  • Poor storage parameters
  • Partitioned index mismanagement

🎯 Best Practices (Going Forward)

✅ Use:

  • Locally Managed Tablespaces (LMT)
  • AUTOALLOCATE or UNIFORM extent size

✅ For large indexes:

  • Predefine proper extent sizes:
CREATE INDEX idx_name
ON table_name(column)
TABLESPACE ts_name
STORAGE (INITIAL 128M NEXT 128M);

✅ Monitor periodically:

SELECT segment_name, extents
FROM dba_segments
WHERE extents > 500;

🚨 When You Can Ignore It

You can safely ignore if:

  • Database uses ASSM + LMT
  • No performance issues observed
  • Extents are auto-managed efficiently

✅ Recommendation for You (Database Architect Perspective)

Given your role, I’d suggest:

  1. ✅ Validate index size vs extents ratio
  2. ✅ Rebuild only if:
    • Extents are very high (>1000)
    • Index is frequently accessed
  3. ✅ Automate check in health scripts
  4. ✅ Standardize storage clauses for large indexes

💬 Summary

IssueAction
Too many extentsRebuild index
FragmentationRebuild or move
Minor issueIgnore if no perf impact

Tuesday, May 26, 2026

Is CPU issue ? troubleshooting workflow for oracle database performance issue with automation

✅ ✅ 1. CPU Troubleshooting Framework (Like iostat for CPU)

Use:

vmstat 2 5

or

top


📊 ✅ 2. CPU Metrics Explained (vmstat / top)

us sy id wa st
MetricMeaning
usUser CPU (app/DB queries)
sySystem CPU (kernel/syscalls)
idIdle CPU
waIO wait
stSteal (VM contention)

✅ ✅ 3. CPU Baseline Table (Production Standard)

📊 CPU Utilization

MetricGood ✅Warning ⚠️Critical 🚨
idle (id)> 40%20–40%< 20%
user (us)< 50%50–70%> 70%
system (sy)< 20%20–30%> 30%
iowait (wa)< 5%5–15%> 20%
steal (st)01–5> 5

📊 Run Queue (Very Important)

r (run queue)
MetricGood ✅Warning ⚠️Critical 🚨
r vs CPU cores≤ cores1.5x cores2x+ cores

✅ ✅ 4. CPU Health Decision Matrix

ConditionMeaning
High usCPU-heavy queries
High sykernel/system overhead
High wanot CPU → disk issue
High rCPU contention
High stVM issue

✅ ✅ 5. CPU Scoring Model (Like Disk)

🎯 Formula

Score = 100 
- user penalty 
- system penalty 
- run queue penalty 
- iowait penalty

📊 Penalties

✅ User CPU

  • <50 → 0
  • 50–70 → -15
  • 70 → -30


✅ System CPU

  • <20 → 0
  • 20–30 → -15
  • 30 → -30


✅ Run Queue

  • ≤ cores → 0
  • 1–2x cores → -20
  • 2x cores → -40


✅ IO Wait

  • <5 → 0
  • 5–15 → -20
  • 15 → -40


✅ Final Status

ScoreStatus
80–100✅ Healthy
60–80⚠️ Warning
40–60🔶 Degraded
<40🚨 Critical

✅ ✅ 6. CPU Monitoring Script (PRODUCTION READY)

📜 cpu_health.sh


#!/bin/bash

echo "===== CPU Health Check ====="
date

vmstat 2 3 | tail -1 | awk '
{
r=$1
us=$13
sy=$14
id=$15
wa=$16
st=$17

score=100

# user penalty
if (us > 70) score -= 30
else if (us > 50) score -= 15

# system penalty
if (sy > 30) score -= 30
else if (sy > 20) score -= 15

# iowait penalty
if (wa > 15) score -= 40
else if (wa > 5) score -= 20

# status
status="HEALTHY"
if (score < 40) status="CRITICAL"
else if (score < 60) status="DEGRADED"
else if (score < 80) status="WARNING"

printf "RunQ=%d User=%d%% Sys=%d%% Idle=%d%% IOwait=%d%% Status=%s Score=%d\n",r,us,sy,id,wa,status,score
}
'

Identify CPU Bottleneck (Root Cause)


✅ Step 1: Check top processes

top -o %CPU

✅ Step 2: Per process CPU

ps -eo pid,ppid,cmd,%mem,%cpu --sort=-%cpu | head

✅ Step 3: Thread-level (VERY IMPORTANT for DB)

top -H

✅ Step 4: Per process breakdown

pidstat -u 2

✅ ✅ 8. CPU Issue Patterns (DBA Mapping)


🔴 Pattern 1: High USER CPU

us > 70%

👉 Cause:

  • Complex SQL
  • Full scans
  • bad execution plans

✅ DB Mapping:

SELECT sql_id, cpu_time
FROM v$sql
ORDER BY cpu_time DESC FETCH FIRST 10 ROWS;


🔴 Pattern 2: High SYSTEM CPU

sy > 30%

👉 Cause:

  • excessive context switching
  • kernel overhead
  • I/O interrupts

✅ Check:

vmstat 1

Look at:

cs (context switches)
in (interrupts)

🔴 Pattern 3: High RUN QUEUE

r >> CPU cores

👉 Cause:

  • CPU contention
  • too many concurrent queries

🔴 Pattern 4: High IO WAIT

wa > 20%

👉 Cause:

  • NOT CPU problem
    👉 disk bottleneck (go back to iostat)

🔴 Pattern 5: High STEAL (VM)

st > 5%

👉 Cause:

  • host contention (cloud/VM issue)

✅ ✅ 9. CPU Alert Script (Automation)

📜 cpu_alert.sh

#!/bin/bash

HOST=$(hostname)

ALERT=$(vmstat 2 2 | tail -1 | awk '
{
us=$13; sy=$14; wa=$16; st=$17

if (us>70 || sy>30 || wa>15 || st>5)
print "CPU Issue: user="us"% sys="sy"% wa="wa"% st="st"%"
}
')

if [ ! -z "$ALERT" ]; then
echo "$ALERT" | mail -s "CPU Alert on $HOST" your_email@company.com
fi
``

✅ ✅ 10. DBA Troubleshooting Flow


🔥 When CPU alert triggers:

1. Check:

vmstat 1 5

2. Identify pattern:

  • High us → SQL issue
  • High sy → OS/kernel
  • High wa → disk issue

3. Identify process:

top

4. Map to DB:

SELECT sql_id, cpu_time FROM v$sql ORDER BY cpu_time DESC;

5. Fix:

  • Tune query ✅
  • Index ✅
  • Reduce parallelism ✅
  • Limit sessions ✅

✅ ✅ 11. Golden Rules (CPU)

✅ Healthy

idle > 40%
wa < 5%
r <= cores

🚨 Critical CPU

idle < 10%
us > 70%
r > 2x cores

⚠️ Trick Case (VERY IMPORTANT)

CPU looks high BUT wa is high
👉 NOT CPU issue → DISK issue

🎯 Final Outcome

Now you have:

✅ CPU baseline table
✅ Health scoring model
✅ Monitoring script
✅ Alerting system
✅ DB correlation
✅ Troubleshooting flow

Setup and design database architecture to meet SLA - 99.9 , 99.99 , 99.999

Achieving distinct Service Level Agreements (SLAs) for a database requires scaling redundancy and infrastructure complexity.  The three-nine...