Tuesday, May 26, 2026

iostat linux command deep drive to troubleshooting the performance issue

iostat -xm 2 5 | awk '$1 ~ /^(sd|dm)/ && $NF > 40 {printf "%-10s %s\n",$1,$NF"%"}'

iostat -xm 2 5 | awk '$NF > 40 {print}'

iostat -xm 2 5 | awk '/Device/ {print; next}$1 ~ /^(sd|dm)/ && $NF > 90 {print}'

📌 Header Breakdown (Deep Explanation)

Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await r_await w_await svctm %util

✅ 1. Device

Logical or physical disk name
- sdX → physical disks
- dm-X → device mapper (LVM, ASM, multipath)

👉 In your case:

dm-* = logical volumes / DB storage layers

✅ 2. rrqm/s (Read Requests Merged per second)

Number of read requests merged by OS scheduler

Why merging matters:

OS combines adjacent reads to reduce I/O calls

👉 Example:

10 small reads → merged → 1 large read

✅ Interpretation:

High value → efficient sequential I/O
Zero → either random I/O or already optimized

✅ 3. wrqm/s (Write Requests Merged per second)

Same as above but for writes

✅ High value:

Good for sequential writes (e.g., redo logs, batch loads)

✅ 4. r/s (Reads per second)

Number of read I/O operations per second

Interpretation:

High r/s = high IOPS (random access likely)

✅ 5. w/s (Writes per second)

Number of write operations per second

👉 Together with r/s:

Indicates workload type:
- OLTP → high r/s + w/s, small IO
- Analytics → lower r/s but large I/O size

✅ 6. rMB/s (Read throughput in MB/sec)

Total data read per second

✅ 7. wMB/s (Write throughput in MB/sec)

Total data written per second

🔎 Important:

Pattern	Meaning
High r/s + low rMB/s	small random IO
Low r/s + high rMB/s	large sequential IO

✅ 8. avgrq-sz (Average Request Size)

Average size of each I/O request (in KB)

Formula:

avgrq-sz = (total sectors read+written) / total I/O ops

Interpretation:

Value	Meaning
< 32 KB	random IO (OLTP)
64–256 KB	mixed
~1024 KB (1MB)	sequential scan

✅ 9. avgqu-sz (Average Queue Length)

Number of I/O requests waiting in queue

🚨 Critical metric:

Value	Impact
< 1	healthy
1–5	moderate
10+	pressure
20+	severe bottleneck

👉 High value means:

Disk is overloaded
Requests are waiting → latency increase

✅ 10. await (Average Wait Time in ms)

Total time for I/O request:
```
wait time = queue time + service time
```

🚨 Thresholds:

Value	Meaning
< 5 ms	excellent
5–20 ms	acceptable
20–50 ms	warning
> 50 ms	serious issue

👉 This is the most important latency metric

✅ 11. r_await (Read latency)

Avg time for read requests

✅ 12. w_await (Write latency)

Avg time for write requests

Why split matters:

Helps identify:
- read-heavy issues (full scan)
- write bottlenecks (redo/log/file sync)

✅ 13. svctm (Service Time)

Time taken by disk to service request
Does NOT include queue time

Important:

await ≈ svctm + queue delay

Interpretation:

Case	Meaning
await ≈ svctm	no queue bottleneck
await >> svctm	queue contention

👉 This is key for bottleneck detection

✅ 14. %util (Utilization)

Percentage of time disk was busy

🚨 Interpretation:

Value	Meaning
< 60%	safe
60–80%	moderate
80–90%	high
> 90%	saturated

👉 BUT:

Must combine with await + queue

🔥 Important Combined Interpretation

✅ Case 1 (Healthy high usage)

%util = 95%
await = 1 ms
avgqu-sz = 1

✔ Efficient disk

🚨 Case 2 (Bottleneck)

%util = 99%
await = 80 ms
avgqu-sz = 20

❌ Disk saturation + queue buildup

🧠 How You Should Read Header (DBA Cheat Sheet)

Step-by-step analysis:

Check %util
- 90 → possible saturation
Check avgqu-sz
- High → queue backlog
Check await
- Confirms latency impact
Compare await vs svctm
- Big gap → queue delay
Check avgrq-sz
- Understand workload type

🎯 Why This Matters for You (Database Architect)

This header directly helps identify:

✅ DB Issues Mapping

Metric	DB Problem
High rMB/s + large avgrq-sz	full table scan
High r/s, low size	index lookup
High w_await	commit / redo issues
High avgqu-sz	storage contention
High await	slow queries

✅ Final Summary

r/s, w/s → IOPS
rMB/s, wMB/s → throughput
avgrq-sz → IO size (random vs sequential)
avgqu-sz → pressure indicator 🚨
await → real latency 🚨
%util → saturation signal

Monday, May 25, 2026

Step-by-Step HugePages Configuration (Oracle 19c on Linux)

🔹 Step 1: Check Current HugePages Status

grep Huge /proc/meminfo

Key parameters:

HugePages_Total
HugePages_Free
Hugepagesize (usually 2 MB)

🔹 Step 2: Disable Transparent HugePages (THP)

Oracle recommends disabling THP.

Check status:

cat /sys/kernel/mm/transparent_hugepage/enabled

Disable temporarily:

echo never > /sys/kernel/mm/transparent_hugepage/enabled

echo never > /sys/kernel/mm/transparent_hugepage/defrag

Disable permanently:

Edit:

vi /etc/default/grub

Add:

transparent_hugepage=never

Then apply:

grub2-mkconfig -o /boot/grub2/grub.cfg

reboot

🔹 Step 3: Calculate Required HugePages

3.1 Check Oracle SGA size

From SQL:

show parameter sga_target;

or:

show parameter memory_target;

3.2 Use Oracle Script (Recommended)

$ORACLE_HOME/bin/hugepages_settings.sh

If not available, download from Oracle MOS.

3.3 Manual Calculation

Formula:

HugePages = Total_SGA_Size / HugePage_Size

Example:

SGA = 64 GB
Page size = 2 MB

64 * 1024 MB / 2 MB = 32768 HugePages

🔹 Step 4: Configure HugePages in Kernel

Edit:

vi /etc/sysctl.conf

Add/update:

vm.nr_hugepages=32768

Apply:

sysctl -p

🔹 Step 5: Configure memlock for Oracle User

Edit:

vi /etc/security/limits.conf

Add:

oracle soft memlock 67108864

oracle hard memlock 67108864

👉 Value should match SGA size (in KB)

Example:

64 GB = 67108864 KB

🔹 Step 6: Disable AMM (Very Important)

HugePages does NOT work with AMM.

Check:

show parameter memory_target;

If > 0, disable:

ALTER SYSTEM SET memory_target=0 SCOPE=SPFILE;

ALTER SYSTEM SET memory_max_target=0 SCOPE=SPFILE;

Instead, use ASMM:

ALTER SYSTEM SET sga_target=64G SCOPE=SPFILE;

ALTER SYSTEM SET pga_aggregate_target=8G SCOPE=SPFILE;

Restart DB:

shutdown immediate;

startup;

🔹 Step 7: Restart Server

reboot

🔹 Step 8: Validate HugePages Usage

grep Huge /proc/meminfo

Check:

HugePages_Total → should match configured value
HugePages_Free → should decrease after DB start
HugePages_Rsvd → reserved pages

🔹 Step 9: Verify Oracle is Using HugePages

grep -i huge /proc/meminfo

Also check alert log:

grep HugePages $ORACLE_BASE/diag/rdbms/*/*/trace/alert*.log

Expected message:

Huge Pages allocation successful

✅ Best Practices

✔ Always leave some RAM for OS (not 100% HugePages)
✔ Use static SGA (ASMM)
✔ Set HugePages slightly higher than requirement
✔ Monitor with:

vmstat

free -g

⚠️ Common Mistakes

❌ Using AMM (memory_target)
❌ Not setting memlock
❌ THP not disabled
❌ Underestimating number of HugePages

🚀 Quick Example Summary

For a 64 GB SGA:

Setting	Value
HugePage size	2 MB
Required pages	32768
memlock	67108864 KB
vm.nr_hugepages	32768

Oracle Script for huge_page setting

vi hugepages_settings.sh

#!/bin/bash

# hugepages_settings.sh

# Linux bash script to compute values for the

# recommended HugePages/HugeTLB configuration

# on Oracle Linux

# Note: This script does calculation for all shared memory

# segments available when the script is run, no matter it

# is an Oracle RDBMS shared memory segment or not.

# This script is provided by KB151310 from My Oracle Support

# http://support.oracle.com

# Welcome text

echo "

This script is provided by KB151310 from My Oracle Support

(http://support.oracle.com) where it is intended to compute values for

the recommended HugePages/HugeTLB configuration for the current shared

memory segments on Oracle Linux. Before proceeding with the execution please note following:

* For ASM instance, it needs to configure ASMM instead of AMM.

* The 'pga_aggregate_target' is outside the SGA and

you should accommodate this while calculating the overall size.

* In case you changes the DB SGA size,

as the new SGA will not fit in the previous HugePages configuration,

it had better disable the whole HugePages,

start the DB with new SGA size and run the script again.

And make sure that:

* Oracle Database instance(s) are up and running

* Oracle Database Automatic Memory Management (AMM) is not setup

(See KB83222)

* The shared memory segments can be listed by command:

# ipcs -m

Press Enter to proceed..."

read

# Check for the kernel version

KERN=`uname -r | awk -F. '{ printf("%d.%d\n",$1,$2); }'`

# Find out the HugePage size

HPG_SZ=`grep Hugepagesize /proc/meminfo | awk '{print $2}'`

if [ -z "$HPG_SZ" ];then

echo "The hugepages may not be supported in the system where the script is being executed."

exit 1

# Initialize the counter

NUM_PG=0

# Cumulative number of pages required to handle the running shared memory segments

for SEG_BYTES in `ipcs -m | cut -c44-300 | awk '{print $1}' | grep "[0-9][0-9]*"`

MIN_PG=`echo "$SEG_BYTES/($HPG_SZ*1024)" | bc -q`

if [ $MIN_PG -gt 0 ]; then

NUM_PG=`echo "$NUM_PG+$MIN_PG+1" | bc -q`

done

RES_BYTES=`echo "$NUM_PG * $HPG_SZ * 1024" | bc -q`

# An SGA less than 100MB does not make sense

# Bail out if that is the case

if [ $RES_BYTES -lt 100000000 ]; then

echo "***********"

echo "** ERROR **"

echo "***********"

echo "Sorry! There are not enough total of shared memory segments allocated for

HugePages configuration. HugePages can only be used for shared memory segments

that you can list by command:

# ipcs -m

of a size that can match an Oracle Database SGA. Please make sure that:

* Oracle Database instance is up and running

* Oracle Database Automatic Memory Management (AMM) is not configured"

exit 1

# Finish with results

echo "Recommended setting: vm.nr_hugepages = $NUM_PG";

# End

Monday, May 18, 2026

step‑by‑step guide to check CPU sockets, cores, threads, and multithreading (Hyper‑Threading/SMT)

Step‑by‑step guide to check CPU sockets, cores, threads, and multithreading (Hyper‑Threading/SMT)

✅ 1. BASIC UNDERSTANDING (IMPORTANT)

Term	Meaning
Socket	Physical CPU installed on motherboard
Core	Physical processing unit inside CPU
Thread	Logical CPU (via Hyper‑Threading / SMT)
Multithreading Enabled?	If Threads > Cores

👉 Example:

1 socket, 4 cores, 8 threads → Hyper‑Threading ENABLED
1 socket, 4 cores, 4 threads → Hyper‑Threading DISABLED

🐧 2. LINUX – STEP BY STEP

🔹 Step 1: Check CPU details

Run:

lscpu

Example output:

CPU(s):              8
Socket(s):           1
Core(s) per socket:  4
Thread(s) per core:  2

Interpret:

Sockets = 1
Cores = 4
Threads = 8
Thread per core = 2 → Multithreading ENABLED ✅

🔹 Step 2: Check using /proc

cat /proc/cpuinfo | grep "physical id" | sort -u | wc -l

👉 Gives number of sockets

cat /proc/cpuinfo | grep "cpu cores" | uniq

👉 Shows cores per socket

cat /proc/cpuinfo | grep "processor" | wc -l

👉 Total logical CPUs (threads)

🔹 Step 3: Quick single command summary

nproc

👉 gives total threads

🔹 Step 4: Check Hyper‑Threading explicitly

lscpu | grep "Thread"

If:

Thread(s) per core: 2

👉 ✅ Hyper‑Threading ON

If:

Thread(s) per core: 1

👉 ❌ OFF

🪟 3. WINDOWS – STEP BY STEP

🔹 Step 1: Task Manager (GUI method)

Press Ctrl + Shift + Esc
Go to Performance tab → CPU
Look at:

Sockets: X
Cores: Y
Logical processors: Z

👉 Example:

Cores = 4
Logical processors = 8
✅ Multithreading ENABLED

🔹 Step 2: Command Prompt

Run:

wmic cpu get NumberOfCores,NumberOfLogicalProcessors

Output:

NumberOfCores  NumberOfLogicalProcessors
4              8

👉 Threads (LogicalProcessors) > Cores
✅ Hyper‑Threading ON

🔹 Step 3: PowerShell (recommended)

Get-WmiObject Win32_Processor | Select NumberOfCores, NumberOfLogicalProcessors, SocketDesignation

OR modern cmdlet:

Get-CimInstance Win32_Processor | Select NumberOfCores,NumberOfLogicalProcessors

🔹 Step 4: System Information

Run:

msinfo32

Check:

Processor
Logical processors

🔍 4. HOW TO TELL IF MULTITHREADING IS ENABLED

✅ Rule:

If Threads > Cores → ENABLED
If Threads = Cores → DISABLED

Example:

Cores	Threads	Status
4	8	✅ Enabled
8	16	✅ Enabled
4	4	❌ Disabled

🔧 5. BIOS LEVEL CHECK (Important)

Even if OS shows cores/threads, final control is in BIOS.

Steps:

Reboot system
Enter BIOS (F2 / DEL / ESC)
Look for:
- Intel Hyper‑Threading
- AMD SMT (Simultaneous Multithreading)

Settings:

Enabled ✅ → uses threads
Disabled ❌ → only physical cores

✅ 6. QUICK CHEAT SHEET

Linux

lscpu

Windows

wmic cpu get NumberOfCores,NumberOfLogicalProcessors

🚀 FINAL SUMMARY

Platform	Command	What to Check
Linux	`lscpu`	Threads per core
Linux	`/proc/cpuinfo`	cores, sockets
Windows	Task Manager	Logical processors
Windows	`wmic`	cores vs threads

SQL> SELECT stat_name, value

FROM v$osstat

WHERE stat_name IN ('NUM_CPUS','NUM_CPU_CORES','NUM_CPU_SOCKETS');

`` 2 3

STAT_NAME VALUE

------------------------------ ----------

NUM_CPUS 8

NUM_CPU_CORES 8

NUM_CPU_SOCKETS 2

SQL>

Interpretation:

NUM_CPUS = logical CPUs visible to the OS (includes HT threads)

NUM_CPU_CORES = physical cores

If NUM_CPUS > NUM_CPU_CORES → SMT/HT enabled

If NUM_CPUS = NUM_CPU_CORES → SMT/HT disabled

SQL> !cat /proc/cpuinfo | egrep "processor|cpu cores|siblings|physical id"

physical id : 0

siblings : 4

cpu cores : 4

processor : 1

cpu cores = physical cores per socket

siblings = logical CPUs per socket

If siblings > cpu cores → SMT/HT enabled

If siblings == cpu cores → SMT/HT disabled

Friday, May 15, 2026

Compare ELK Stack vs Splunk vs Prometheus + Grafana

🔷 1. What Each Stack Represents

Stack	Components	Focus Area
ELK Stack	Elasticsearch + Logstash + Kibana (+ Beats)	Logging & search
Splunk	Single integrated platform	Logging + analytics + security
Prometheus + Grafana	Prometheus + Alertmanager + Grafana	Metrics & monitoring

🔷 2. Architecture Overview

✅ ELK Stack (Open-source logging stack)

Sources → Beats/Logstash → Elasticsearch → Kibana

Logstash/Beats → Collect logs
Elasticsearch → Store & index logs
Kibana → Visualize logs

👉 Fully open-source (except Elastic licensing changes in newer versions)

✅ Splunk (Enterprise platform)

Sources → Forwarders → Splunk Indexer → Splunk UI

Collects logs via agents
Indexes data internally
Built-in dashboards + query engine

👉 Everything in one ecosystem

✅ Prometheus + Grafana (Monitoring stack)

Applications → Prometheus → Grafana → Alerts

Prometheus → collects metrics (pull model)
Grafana → visualization
Alertmanager → alerts

👉 Works best for real-time system health metrics

🔷 3. Core Differences (Very Important)

🔹 Data Type Focus

Tool	Data Type
Prometheus	Metrics (numbers, time-series)
ELK	Logs (text, JSON, events)
Splunk	Logs + Events + Metrics

🔹 Ease of Use

Tool	Complexity
Splunk	✅ Easiest (plug-and-play)
ELK	⚠️ Moderate (setup + tuning needed)
Prometheus	✅ Easy for metrics, not logs

🔹 Cost

Tool	Cost
Prometheus + Grafana	✅ Free
ELK Stack	✅ Mostly free (some paid features)
Splunk	❌ Very expensive (license based on data volume)

🔹 Query Language

Tool	Query
Prometheus	PromQL
ELK	Lucene / KQL
Splunk	SPL (very powerful)

🔷 4. When to Use What (Real Scenarios)

✅ Use Prometheus + Grafana when:

Monitoring DB performance (CPU, IO, connections)
Kubernetes / microservices environment
Need real-time alerting

👉 Example for you:

Oracle DB metrics
MongoDB performance monitoring
API latency tracking

✅ Use ELK Stack when:

You want centralized logging
Need search + troubleshooting
Want open-source flexibility

👉 Example:

DB audit logs
Application logs
Slow query logs

✅ Use Splunk when:

Enterprise-grade security + compliance (SOX, audit)
Need correlation across logs + events
Budget is not a constraint

👉 Example:

Audit trails (critical for SOX 👀)
Fraud detection
Security monitoring (SIEM)

🔷 5. Feature Comparison

Feature	ELK	Splunk	Prometheus
Log Management	✅	✅	❌
Metrics Monitoring	⚠️ Limited	✅	✅
Visualization	✅ (Kibana)	✅	✅ (Grafana)
Alerting	✅	✅	✅
AI/ML Insights	⚠️ Limited	✅ Strong	⚠️ Basic
Scalability	✅ High	✅ Very High	✅ High
Setup Effort	⚠️ Medium	✅ Easy	✅ Easy

🔷 6. Real-World Architecture (Recommended)

As a Database Architect, best practice is NOT to choose one—combine them:

✅ Modern Observability Stack

                ┌──────────────┐
                │ Applications │
                └──────┬───────┘
                       │
        ┌──────────────┼──────────────┐
        │                             │
   Metrics                        Logs
        │                             │
Prometheus                    ELK / Splunk
        │                             │
   Grafana                     Kibana / Splunk UI

🔷 7. Practical Recommendation for YOU

Since you're aiming toward Databricks / modern architecture, here’s a strong stack:

✅ Ideal Stack:

Prometheus + Grafana → Monitoring
ELK (or OpenSearch) → Logging (cost-effective)
Splunk (optional) → Only if:
- You need SOX compliance
- Heavy audit requirements

🔷 8. Quick Analogy

Tool	Analogy
Prometheus	Heart rate monitor
Grafana	Dashboard display
ELK	Detailed diary/logbook
Splunk	Intelligent investigator

✅ Final Summary (Simple)

Prometheus + Grafana → Monitoring (numbers)
ELK → Logging (open source)
Splunk → Enterprise logging + security

🔥 Pro Tip (Important for your role)

Given your SOX + DB architecture interest, focus on:

Prometheus → DB health monitoring
ELK/Splunk → audit logging + compliance
Grafana → unified dashboard for both

What and where we use Grafana, Splunk, and Prometheus ?

Grafana, Splunk, and Prometheus are all popular tools used in monitoring, logging, and observability, but they serve different purposes and are often used together rather than being direct replacements.

Here’s a clear breakdown:

🔹 1. High-Level Difference

Tool	Primary Purpose	What it Does Best
Grafana	Visualization & dashboards	Displays data from multiple sources
Prometheus	Metrics collection & monitoring	Stores and queries time-series metrics
Splunk	Log management & analytics	Centralized log search & analysis

🔹 2. Detailed Explanation

✅ Grafana

Role: Visualization layer
Type: Dashboarding tool
Core Function:
Displays data from sources like Prometheus, Elasticsearch, SQL DBs, etc.

Key Features:

Rich dashboards & graphs
Alerts & notifications
Supports multiple data sources
Great UI for business + tech users

Example:

You can use Grafana to:

Show CPU usage graphs
Visualize API latency trends
Create executive dashboards

👉 Important: Grafana does NOT store data itself (except limited cases). It reads from data sources.

✅ Prometheus

Role: Monitoring & metrics collection
Type: Time-series database
Core Function:
Collects numerical metrics from systems and stores them.

Key Features:

Pull-based monitoring (scrapes metrics from endpoints)
Uses PromQL (query language)
Built-in alerting (Alertmanager)
Ideal for Kubernetes / cloud-native setups

Example Metrics:

CPU usage
Memory consumption
Request rates
Error counts

👉 Prometheus is focused on metrics only, not logs.

✅ Splunk

Role: Log management & analytics platform
Type: Enterprise data platform
Core Function:
Collects, indexes, and analyzes machine-generated data (logs).

Key Features:

Powerful log search (SPL – Search Processing Language)
Real-time log monitoring
Security (SIEM), compliance, and audit use cases
Handles structured & unstructured data

Example Logs:

Application logs
Server logs
Security logs
Audit trails

👉 Splunk is not just monitoring—it’s full-scale data analytics + security platform.

🔹 3. Key Differences (Simple View)

Feature	Grafana	Prometheus	Splunk
Stores Data	❌ No (mostly)	✅ Yes	✅ Yes
Data Type	Visual output	Metrics (numbers)	Logs + events
Query Language	Depends on source	PromQL	SPL
Use Case	Dashboards	Monitoring systems	Log analysis & security
Cost	Free/Open Source	Free/Open Source	Paid (expensive)

🔹 4. How They Work Together

A very common real-world setup:

Prometheus → collects metrics
       ↓
Grafana → visualizes metrics

And separately:

Applications → send logs → Splunk → analyze/search logs

👉 In modern architectures:

Prometheus + Grafana = Monitoring stack
Splunk = Logging + Security analytics

🔹 5. When to Use What

Use Prometheus when:

You need system/service monitoring
You work with Kubernetes or microservices
You need metrics-based alerting

Use Grafana when:

You want beautiful dashboards
You need to visualize multiple data sources
You want business + technical insights

Use Splunk when:

You need centralized logging
You need security monitoring (SIEM)
You want deep log analysis and auditing

🔹 6. Easy Analogy

Think of a car dashboard:

Prometheus → collects speed, fuel, engine data
Grafana → shows it neatly on the dashboard
Splunk → records everything that happened (every trip, error, warning)

✅ Final Summary

Grafana → “Show me the data”
Prometheus → “Collect and store metrics”
Splunk → “Analyze logs and events”