Wednesday, May 27, 2026

Oracle/OS huge page help to increase IO - Yes or No ?

Short Answer

No — HugePages do not directly increase disk I/O throughput or IOPS.

What they do improve is memory-management efficiency for Oracle’s SGA, which can indirectly improve overall database performance in workloads where CPU overhead, memory translation, and buffer cache efficiency matter.

So the precise answer is:

Direct disk I/O improvement: No
Indirect database performance improvement: Yes, sometimes
Can query response time improve? Yes
Can storage subsystem become faster? No

Why the Answer Is “No”

When people say “I/O” in Oracle, they often mix up two different things:

1. Disk I/O

This means:

reading from storage
writing to storage
IOPS
throughput (MB/s)
latency from disk/ASM/SAN/NVMe

HugePages do not make:

disks spin faster
SSDs respond faster
SAN latency lower
ASM perform more physical reads/sec by itself

So if your system is bottlenecked on:

slow storage,
high read latency,
redo write latency,
log file sync due to storage,
db file sequential/scattered read delays,

then HugePages will not fix that bottleneck.

Then Why Do People Say HugePages Help Performance?

Because Oracle uses HugePages mainly for the SGA (System Global Area), and that changes how efficiently memory is managed, not how storage hardware performs.

HugePages help by reducing:

number of memory pages
page table overhead
TLB misses
kernel memory management overhead
swapping risk for SGA

This can reduce CPU consumption and improve scalability, especially for large SGAs.

That means:

buffer cache access becomes cheaper from a CPU/memory-translation perspective
shared pool access can be more efficient
the system spends less overhead managing memory mappings

So the database may perform better overall — but that is not the same as disk I/O becoming faster.

The Correct Mental Model

Think of it like this:

Without HugePages

Oracle SGA is built using many small OS pages (typically 4 KB).

With HugePages

Oracle SGA is backed by large pages (typically 2 MB).

This means the OS and CPU can manage Oracle’s large memory regions more efficiently.

But the database blocks are still:

8 KB / 16 KB Oracle blocks,
read from the same disks,
through the same storage path,
at the same storage latency.

So:

HugePages optimize memory handling, not disk mechanics.

Where HugePages Can Indirectly Reduce I/O

Here’s the subtle but important part.

HugePages do not speed up physical I/O directly, but they can help Oracle make better use of memory, and that can reduce the need for physical I/O indirectly in some cases.

Example 1: Large Buffer Cache Works Better

If you have a large SGA and buffer cache, HugePages help Oracle manage that memory more efficiently.

Result:

more stable SGA usage
lower CPU overhead
better scalability under load

If the buffer cache is effective, more reads are served from memory instead of disk.

So you may observe:

fewer physical reads
better response time
lower pressure on storage

But the key point is:

HugePages did not make a single disk read faster.
They helped Oracle use memory better, which may reduce the number of disk reads required.

Example 2: Less Swapping = Less Disaster

If HugePages are not used properly and memory pressure happens, parts of memory management may become inefficient, and in bad configurations there can be swapping-related problems.

HugePages pin SGA memory and prevent it from being swapped.

That means:

Oracle buffer cache stays resident
shared memory remains stable
performance doesn’t collapse due to memory pressure

Again, that is not “faster disk I/O”; it is avoiding performance degradation.

When HugePages Help the Most

HugePages are most useful when:

SGA is large (tens or hundreds of GB)
Oracle host has many concurrent sessions
CPU overhead matters
TLB misses are significant
page table size is large
multiple Oracle instances are running
you need stable, predictable performance

In such systems, HugePages can improve:

CPU efficiency
memory translation overhead
latency consistency
overall throughput of the DB server

That may look like I/O improvement from the application side because transactions become faster — but the actual storage path has not changed.

When HugePages Will Not Help Much

HugePages usually give little or no visible benefit if:

SGA is small
workload is purely storage-bound
storage latency is the dominant bottleneck
system is already under-tuned at SQL/query level
bad execution plans are causing excess reads
indexes/statistics/partitioning are poor

In those cases, if someone asks:

“Can HugePages increase I/O?”

The right answer is:

No, not in the way you are probably hoping.
You need to fix storage, SQL, or database design.

Direct I/O vs Buffered I/O: Important Distinction

If Oracle uses Direct I/O / ASM

Data may bypass the OS page cache and move between storage and Oracle-managed memory.

HugePages help the Oracle memory side (SGA), but:

they still do not change storage latency
they still do not increase IOPS capacity

If Oracle uses filesystem buffered I/O

OS page cache uses normal pages, not HugePages in the Oracle HugePages sense.

Again:

HugePages help Oracle SGA
not the OS page cache I/O path

So in both cases: storage I/O itself does not become faster because HugePages were enabled.

Why Some DBAs Believe HugePages Improved I/O

Because after enabling HugePages they may observe:

lower DB CPU usage
faster SQL execution
fewer stalls
lower latency at application level
more stable throughput

This leads to the impression:

“HugePages improved I/O”

But the real reason is often:

Oracle memory management became more efficient
SGA became more stable
Buffer cache behavior improved
CPU spent less time on page translation / kernel overhead
More work completed per second

That’s an overall performance gain, not a storage subsystem gain.

Simple Yes/No by Scenario

Scenario A: Slow SAN / slow disk / high read latency

Will HugePages increase disk I/O speed?
No

Scenario B: Very large SGA, CPU overhead high, many TLB misses

Will HugePages improve DB performance?
Yes

Scenario C: Buffer cache becomes more effective and physical reads drop

Did HugePages increase I/O?
No directly
They helped reduce the need for some physical I/O.

Scenario D: Oracle server was unstable due to memory fragmentation/swapping risk

Can HugePages make performance much better?
Yes But again, not because disk became faster.

Best Technical Statement You Can Use

If you want a precise architect-level answer, use this:

HugePages do not directly increase Oracle physical I/O throughput, IOPS, or reduce storage latency. They improve SGA memory efficiency by reducing page table overhead, TLB misses, and swapping risk. This can indirectly improve overall database performance and, in some workloads, reduce physical I/O demand by improving memory utilization and buffer cache effectiveness.

Final Conclusion

Answer:

No, HugePages do not directly increase I/O.

Detailed meaning:

Do they increase disk speed? → No
Do they reduce storage latency? → No
Do they improve Oracle memory efficiency? → Yes
Can that indirectly improve DB performance and reduce some physical reads? → Yes
Should you configure HugePages for large Oracle databases? → Absolutely yes, but for memory efficiency and stability, not because they “make I/O faster.”

Practical DBA Rule

If the bottleneck is:

storage → fix storage / SQL / access path
memory management / large SGA / CPU overhead → HugePages help
both → HugePages help one side only (memory side)

Relationship between huge page and OS Page

1. Normal OS Page vs HugePage (Fundamentals)

Default Linux memory pages

Standard page size: 4 KB
Every memory access uses page tables to translate virtual → physical
Large SGA → millions of page table entries (PTEs)

HugePages (HugeTLB)

Larger page size: typically 2 MB (or 1 GB on some systems)
Reduces:
- Page table entries
- TLB (Translation Lookaside Buffer) misses
- CPU overhead

2. How Oracle Uses HugePages Internally

SGA allocation path

When HugePages are enabled (USE_LARGE_PAGES=TRUE|ONLY):

Oracle requests SGA memory during startup
Kernel checks:
- Are HugePages available (/proc/meminfo)?
If yes:
- Allocates SGA entirely from HugePages pool
- Pins memory → not swappable
If not:
- Falls back to normal 4KB pages (unless ONLY)

Key behavior

Only SGA uses HugePages
PGA, stack, processes → still use normal pages

3. Interaction with OS Page Cache & IO

This is where most confusion happens.

A. Buffered I/O (filesystem)

Uses OS page cache
Pages are normal 4KB pages

Reads:

Disk → OS Page Cache → Oracle buffer cache

✅ HugePages are NOT used in OS page cache

B. Direct I/O (O_DIRECT / ASM / Filesystem with DirectIO)

Bypasses OS page cache
Data flows:

Disk → Oracle SGA buffer cache (HugePages)

Important insight:

With HugePages:
- Buffer cache resides in HugePages
- Data blocks (8 KB, 16 KB) are packed inside HugePages

👉 HugePages improve:

Memory lookup efficiency
Buffer cache access speed

👉 But they do NOT change I/O block size

4. Relationship: HugePages vs DB Block vs OS Page

Layer	Typical Size
Oracle DB block	8 KB (default)
OS normal page	4 KB
HugePage	2 MB

Internal packing

A single HugePage (2 MB) holds:

2 MB / 8 KB = 256 Oracle blocks

So:

Oracle still uses 8 KB blocks
HugePages just back the memory region

5. Does HugePages Increase IO Performance?

Direct Impact:

❌ Does NOT increase disk IO throughput directly
✅ Reduces CPU overhead during memory access

Indirect Impact:

Faster buffer cache lookups
Reduced TLB misses
Better scalability for large SGA
Less kernel overhead → more CPU for DB work

👉 So overall DB performance improves, but IO bandwidth stays same

6. How HugePages Reduce Kernel Overhead

Without HugePages (example 500 GB SGA):

500 GB / 4 KB ≈ 131 million pages

With HugePages (2 MB):

500 GB / 2 MB ≈ 256,000 pages

✅ Reduction:

Page table size drastically reduced
Fewer TLB entries needed
Less CPU spent on memory translation

7. Key Kernel Components Involved

1. TLB (Translation Lookaside Buffer)

Cache of virtual→physical mappings
HugePages → fewer entries needed → less misses

2. Page Tables

Shrink significantly with HugePages

3. Buddy Allocator / HugeTLB pool

Pre-reserved memory (vm.nr_hugepages)
Prevents fragmentation issues

8. Why HugePages Do NOT Affect OS Page Cache

Because:

Page cache is managed by kernel VM subsystem
HugePages are:
- Preallocated
- Not swappable
- Not part of regular memory pool

👉 They are isolated memory region

9. How HugePages Size Is Increased

Step 1: Calculate requirement

HugePages = (SGA size) / HugePage size

Example:

SGA = 100 GB
HugePage = 2 MB

100 GB / 2 MB = 51200 pages

Step 2: Configure kernel

vm.nr_hugepages = 51200

Step 3: Reload

sysctl -p

Step 4: Verify

cat /proc/meminfo | grep Huge

Important fields:

HugePages_Total
HugePages_Free
HugePages_Rsvd
HugePages_Surp

10. Advanced Behavior (Very Important)

Partial allocation

If HugePages are insufficient:

Oracle may split:
- Part SGA → HugePages
- Rest → normal pages ❌ (bad)

👉 Causes:

Performance inconsistency
Memory fragmentation

✅ Best practice:

USE_LARGE_PAGES=ONLY

11. Summary (Core Concepts)

What HugePages DO

✅ Optimize SGA memory management
✅ Reduce CPU overhead
✅ Improve scalability
✅ Prevent swapping

What HugePages DO NOT DO

❌ Do not change DB block size
❌ Do not increase disk IO speed
❌ Do not affect OS page cache

12. Simple Mental Model

Think of it like this:

Regular pages → many small boxes (slow to manage)
HugePages → fewer large containers (efficient)

Oracle:

Stores data blocks inside containers
Still processes blocks same way

13. DBA Insight (Important for you as Architect)

Given your role:

HugePages matter most when:

SGA > 16–32 GB
High OLTP concurrency
CPU contention exists

They matter less when:

IO bound system (slow storage)
Small SGA

How to Identify and fix Index is Causing Performance Issue in oracle database ?

🔎 PART 1 — How to Identify Index is Causing Performance Issue

✅ Step 1: Find Slow SQL (Entry Point)

Start from Top SQL, not from the index.

SELECT sql_id, executions, elapsed_time, buffer_gets

FROM v$sql

ORDER BY elapsed_time DESC FETCH FIRST 10 ROWS ONLY;

👉 Pick the problematic SQL

✅ Step 2: Check Execution Plan

SELECT * FROM TABLE(DBMS_XPLAN.DISPLAY_CURSOR('sql_id', NULL, 'ALLSTATS LAST'));

🔍 Look for:

Symptom	Meaning
FULL TABLE SCAN	Index missing or ignored ❌
INDEX RANGE SCAN with high cost	Index inefficient ❌
INDEX FULL SCAN	Poor selectivity ❌

✅ Step 3: Check Index Access Cost

In plan, observe:

Cost
Rows
Bytes
Buffer Gets

👉 If index scan:

Reads too many blocks → Index is not selective

✅ Step 4: Check Clustering Factor (Critical)

SELECT index_name, clustering_factor, num_rows
FROM dba_indexes
WHERE index_name = 'INDEX_NAME';

Interpretation:

CF ≈ number of rows → ❌ Random reads (BAD index)
CF ≈ number of blocks → ✅ Efficient index

👉 High CF → more I/O → slower query

✅ Step 5: Check Selectivity (Very Important)

SELECT num_distinct, num_rows
FROM dba_tab_col_statistics
WHERE table_name = 'TABLE_NAME'
AND column_name = 'COLUMN_NAME';

👉 Formula:

Selectivity = num_distinct / num_rows

Result:

Value	Meaning
High (close to 1)	✅ Good index
Low (<0.1)	❌ Bad index

👉 Example:

status = 'ACTIVE' → bad index
customer_id → good index

✅ Step 6: Check If Index is Used Properly

ALTER INDEX schema.index_name MONITORING USAGE;

Then:

SELECT * FROM v$object_usage WHERE index_name = 'INDEX_NAME';

👉 Result:

USED = NO → Index useless
USED = YES but slow → Index is problem

✅ Step 7: Check Logical Reads (Performance Impact)

FROM v$segment_statistics
WHERE object_name = 'INDEX_NAME'
AND statistic_name = 'logical reads';

👉 High logical reads = heavy I/O → possible issue

✅ Step 8: Check Index BLEVEL (Depth)

SELECT index_name, blevel
FROM dba_indexes
WHERE index_name = 'INDEX_NAME';

BLEVEL	Meaning
0–2	✅ Good
3–4	⚠️ Ok
>4	❌ Too deep

✅ Step 9: Validate Stats

EXEC DBMS_STATS.GATHER_INDEX_STATS('SCHEMA','INDEX_NAME');

👉 Bad stats = wrong execution plan

✅ Step 10: Compare With Full Table Scan

Force test:

SELECT /*+ FULL(table_name) */ ..

👉 If FULL scan is faster → Index is hurting performance

🚨 Conclusion (Index Causing Issue If):

High clustering factor
Low selectivity
High logical reads
Not used or incorrectly used
Execution plan shows inefficiency

✅ PART 2 — How to Check Index is GOOD or BAD

🎯 Step-by-Step Evaluation Framework

✅ Step 1: Usage Check

SELECT * FROM v$object_usage

WHERE index_name = 'INDEX_NAME';

Result	Decision
USED frequently	✅ Good
NOT USED	❌ Drop candidate

✅ Step 2: Query Alignment

👉 Check if index matches WHERE clause:

WHERE col1 = :1 AND col2 = :2

👉 Good index:

(col1, col2)

👉 Bad index:

(col3, col4)

✅ Step 3: Selectivity Check

High distinct values → ✅ Good
Low distinct values → ❌ Bad

✅ Step 4: Clustering Factor

Low CF → ✅ Good
High CF → ❌ Bad

✅ Step 5: Size vs Benefit

SELECT bytes/1024/1024 MB
FROM dba_segments
WHERE segment_name='INDEX_NAME';

👉 Large + rarely used → ❌ Bad

✅ Step 6: DML Overhead Check

👉 Too many indexes on high DML table:

Slows INSERT/UPDATE/DELETE

SELECT index_name
FROM dba_indexes
WHERE table_name = 'TABLE_NAME';

👉 More indexes ≠ better

✅ Step 7: Redundancy Check

👉 Duplicate indexes = waste

✅ Step 8: BLEVEL + Leaf Blocks

👉 Small, compact index = good
👉 Huge fragmented index = bad

✅ FINAL DECISION MATRIX

Condition	Good Index ✅	Bad Index ❌
Usage	Frequent	Not used
Execution Plan	Index scan	Full scan
Selectivity	High	Low
Clustering Factor	Low	High
BLEVEL	≤2	>4
Size	Optimized	Huge unused
DML Impact	Low	High

🔥 REAL-WORLD QUICK CHECK SCRIPT

i.index_name,
i.clustering_factor,
i.num_rows,
i.distinct_keys,
s.bytes/1024/1024 MB
FROM dba_indexes i
JOIN dba_segments s
ON i.index_name = s.segment_name
WHERE i.owner = 'SCHEMA_NAME'
ORDER BY s.bytes DESC;

💡 Architect-Level Insight (Important)

👉 Don’t judge index in isolation

Always evaluate:

SQL → Execution Plan → Index → Data Pattern

✅ Recommended Approach (Best Practice)

Identify top SQL (AWR/V$SQL)
Analyze execution plan
Validate index usage
Check clustering factor + selectivity
Decide: KEEP / REBUILD / DROP

INDEX HEALTH CLASSIFICATION SCRIPT

set lines 300 pages 500

col OWNER for a10

col INDEX_NAME for a50

col TABLE_NAME for a50

WITH idx_stats AS (

SELECT

i.owner,

i.index_name,

i.table_name,

i.blevel,

i.clustering_factor,

i.num_rows,

i.distinct_keys,

s.bytes/1024/1024 AS size_mb,

CASE

WHEN i.num_rows > 0

THEN ROUND(i.distinct_keys / i.num_rows, 4)

ELSE 0

END AS selectivity

FROM dba_indexes i

JOIN dba_segments s

ON i.owner = s.owner

AND i.index_name = s.segment_name

WHERE i.owner IN ('RITRS')

AND CLUSTERING_FACTOR >700

usage_stats AS (

SELECT

index_name,

CASE

WHEN used = 'YES' THEN 'USED'

ELSE 'NOT_USED'

END AS usage_status

FROM v$object_usage

)

SELECT

i.owner,

i.index_name,

i.table_name,

i.size_mb,

i.blevel,

i.clustering_factor,

i.num_rows,

i.distinct_keys,

i.selectivity,

NVL(u.usage_status, 'UNKNOWN') AS usage_status,

CASE

WHEN NVL(u.usage_status, 'NOT_USED') = 'NOT_USED'

THEN 'DROP_CANDIDATE'

WHEN i.selectivity < 0.05

THEN 'BAD_INDEX_LOW_SELECTIVITY'

WHEN i.clustering_factor > i.num_rows * 0.9

THEN 'BAD_INDEX_HIGH_CF'

WHEN i.blevel > 4

THEN 'REBUILD_REQUIRED'

WHEN i.size_mb > 500

AND NVL(u.usage_status, 'NOT_USED') <> 'USED'

THEN 'REVIEW_BIG_UNUSED_INDEX'

ELSE 'GOOD_INDEX'

END AS index_health_status

FROM idx_stats i

LEFT JOIN usage_stats u

ON i.index_name = u.index_name

ORDER BY i.size_mb DESC

Alert Fix : 1 index(es) with too many extents in oracle database - oracle performance issue

🔍 What it Means

Oracle stores segments (tables, indexes) in extents (contiguous blocks).
Over time, due to growth, rebuilds, and DML, an index may end up with:
- Hundreds or thousands of small extents
This is usually flagged when:

extents > threshold (e.g., 1000 or configurable limit)

⚠️ Why It Matters

Generally not critical in modern Oracle (ASSM + LMT) environments.
But can still cause:
- Slight performance overhead in segment management
- Longer backup/restore times
- Inefficient disk usage in older systems
- Red flag in audits / compliance reports

👉 In most modern systems, it's more of a hygiene issue than a real performance problem.

✅ Identify the Problem Index

Run:

SELECT owner, segment_name, segment_type, extents

FROM dba_segments

WHERE segment_type = 'INDEX'

AND extents > 500 -- adjust threshold

ORDER BY extents DESC;

Or specifically:

SELECT owner, index_name

FROM dba_indexes

WHERE index_name IN (

SELECT segment_name

FROM dba_segments

WHERE extents > 500

);

🛠️ Fix Options

✅ Option 1: Rebuild the Index (Recommended)

ALTER INDEX schema_name.index_name REBUILD;

Benefits:

Reduces number of extents
Compacts index
Eliminates fragmentation

👉 If large index:

ALTER INDEX schema_name.index_name REBUILD ONLINE;

✅ Option 2: Rebuild with Storage Clause

To control extent allocation:

ALTER INDEX schema_name.index_name REBUILD

STORAGE (INITIAL 100M NEXT 100M);

👉 Helps avoid many small extents in future.

✅ Option 3: Move to Different Tablespace

If fragmentation is due to tablespace issues:

ALTER INDEX schema_name.index_name

REBUILD TABLESPACE new_tablespace;

✅ Option 4: Coalesce (Less Effective)

ALTER INDEX schema_name.index_name COALESCE;

Merges adjacent leaf blocks
Does NOT reduce extents
Safer but less impactful

🧠 Root Cause Analysis

Common reasons:

Autoallocate tablespace creating many uneven extents
Frequent index growth/shrink cycles
Old dictionary-managed tablespaces (DMT)
Poor storage parameters
Partitioned index mismanagement

🎯 Best Practices (Going Forward)

✅ Use:

Locally Managed Tablespaces (LMT)
AUTOALLOCATE or UNIFORM extent size

✅ For large indexes:

Predefine proper extent sizes:

CREATE INDEX idx_name
ON table_name(column)
TABLESPACE ts_name
STORAGE (INITIAL 128M NEXT 128M);

✅ Monitor periodically:

SELECT segment_name, extents

FROM dba_segments

WHERE extents > 500;

🚨 When You Can Ignore It

You can safely ignore if:

Database uses ASSM + LMT
No performance issues observed
Extents are auto-managed efficiently

✅ Recommendation for You (Database Architect Perspective)

Given your role, I’d suggest:

✅ Validate index size vs extents ratio
✅ Rebuild only if:
- Extents are very high (>1000)
- Index is frequently accessed
✅ Automate check in health scripts
✅ Standardize storage clauses for large indexes

💬 Summary

Issue	Action
Too many extents	Rebuild index
Fragmentation	Rebuild or move
Minor issue	Ignore if no perf impact

Tuesday, May 26, 2026

Is CPU issue ? troubleshooting workflow for oracle database performance issue with automation

✅ ✅ 1. CPU Troubleshooting Framework (Like iostat for CPU)

Use:

vmstat 2 5

top

📊 ✅ 2. CPU Metrics Explained (vmstat / top)

us sy id wa st

Metric	Meaning
us	User CPU (app/DB queries)
sy	System CPU (kernel/syscalls)
id	Idle CPU
wa	IO wait
st	Steal (VM contention)

✅ ✅ 3. CPU Baseline Table (Production Standard)

📊 CPU Utilization

Metric	Good ✅	Warning ⚠️	Critical 🚨
idle (id)	> 40%	20–40%	< 20%
user (us)	< 50%	50–70%	> 70%
system (sy)	< 20%	20–30%	> 30%
iowait (wa)	< 5%	5–15%	> 20%
steal (st)	0	1–5	> 5

📊 Run Queue (Very Important)

r (run queue)

Metric	Good ✅	Warning ⚠️	Critical 🚨
r vs CPU cores	≤ cores	1.5x cores	2x+ cores

✅ ✅ 4. CPU Health Decision Matrix

Condition	Meaning
High us	CPU-heavy queries
High sy	kernel/system overhead
High wa	not CPU → disk issue
High r	CPU contention
High st	VM issue

✅ ✅ 5. CPU Scoring Model (Like Disk)

🎯 Formula

Score = 100 
- user penalty 
- system penalty 
- run queue penalty 
- iowait penalty

📊 Penalties

✅ User CPU

<50 → 0
50–70 → -15
70 → -30

✅ System CPU

<20 → 0
20–30 → -15
30 → -30

✅ Run Queue

≤ cores → 0
1–2x cores → -20
2x cores → -40

✅ IO Wait

<5 → 0
5–15 → -20
15 → -40

✅ Final Status

Score	Status
80–100	✅ Healthy
60–80	⚠️ Warning
40–60	🔶 Degraded
<40	🚨 Critical

✅ ✅ 6. CPU Monitoring Script (PRODUCTION READY)

📜 `cpu_health.sh`

#!/bin/bash

echo "===== CPU Health Check ====="
date

vmstat 2 3 | tail -1 | awk '
{
r=$1
us=$13
sy=$14
id=$15
wa=$16
st=$17

score=100

# user penalty
if (us > 70) score -= 30
else if (us > 50) score -= 15

# system penalty
if (sy > 30) score -= 30
else if (sy > 20) score -= 15

# iowait penalty
if (wa > 15) score -= 40
else if (wa > 5) score -= 20

# status
status="HEALTHY"
if (score < 40) status="CRITICAL"
else if (score < 60) status="DEGRADED"
else if (score < 80) status="WARNING"

printf "RunQ=%d User=%d%% Sys=%d%% Idle=%d%% IOwait=%d%% Status=%s Score=%d\n",r,us,sy,id,wa,status,score
}
'

Identify CPU Bottleneck (Root Cause)

✅ Step 1: Check top processes

top -o %CPU

✅ Step 2: Per process CPU

ps -eo pid,ppid,cmd,%mem,%cpu --sort=-%cpu | head

✅ Step 3: Thread-level (VERY IMPORTANT for DB)

top -H

✅ Step 4: Per process breakdown

pidstat -u 2

✅ ✅ 8. CPU Issue Patterns (DBA Mapping)

🔴 Pattern 1: High USER CPU

us > 70%

👉 Cause:

Complex SQL
Full scans
bad execution plans

✅ DB Mapping:

SELECT sql_id, cpu_time

FROM v$sql

ORDER BY cpu_time DESC FETCH FIRST 10 ROWS;

🔴 Pattern 2: High SYSTEM CPU

sy > 30%

👉 Cause:

excessive context switching
kernel overhead
I/O interrupts

✅ Check:

vmstat 1

Look at:

cs (context switches)
in (interrupts)

🔴 Pattern 3: High RUN QUEUE

r >> CPU cores

👉 Cause:

CPU contention
too many concurrent queries

🔴 Pattern 4: High IO WAIT

wa > 20%

👉 Cause:

NOT CPU problem
👉 disk bottleneck (go back to iostat)

🔴 Pattern 5: High STEAL (VM)

st > 5%

👉 Cause:

host contention (cloud/VM issue)

✅ ✅ 9. CPU Alert Script (Automation)

📜 `cpu_alert.sh`

#!/bin/bash

HOST=$(hostname)

ALERT=$(vmstat 2 2 | tail -1 | awk '
{
us=$13; sy=$14; wa=$16; st=$17

if (us>70 || sy>30 || wa>15 || st>5)
print "CPU Issue: user="us"% sys="sy"% wa="wa"% st="st"%"
}
')

if [ ! -z "$ALERT" ]; then
echo "$ALERT" | mail -s "CPU Alert on $HOST" your_email@company.com
fi
``

✅ ✅ 10. DBA Troubleshooting Flow

🔥 When CPU alert triggers:

1. Check:

vmstat 1 5

2. Identify pattern:

High us → SQL issue
High sy → OS/kernel
High wa → disk issue

3. Identify process:

top

4. Map to DB:

SELECT sql_id, cpu_time FROM v$sql ORDER BY cpu_time DESC;

5. Fix:

Tune query ✅
Index ✅
Reduce parallelism ✅
Limit sessions ✅

✅ ✅ 11. Golden Rules (CPU)

✅ Healthy

idle > 40%
wa < 5%
r <= cores

🚨 Critical CPU

idle < 10%
us > 70%
r > 2x cores

⚠️ Trick Case (VERY IMPORTANT)

CPU looks high BUT wa is high
👉 NOT CPU issue → DISK issue

🎯 Final Outcome

Now you have:

✅ CPU baseline table
✅ Health scoring model
✅ Monitoring script
✅ Alerting system
✅ DB correlation
✅ Troubleshooting flow