Tuesday, May 26, 2026

iostat linux command deep drive to troubleshooting the performance issue

  iostat -xm 2 5 | awk '$1 ~ /^(sd|dm)/ && $NF > 40 {printf "%-10s %s\n",$1,$NF"%"}'

 iostat -xm 2 5 | awk '$NF > 40 {print}'

 iostat -xm 2 5 | awk '/Device/ {print; next}$1 ~ /^(sd|dm)/ && $NF > 90 {print}'


📌 Header Breakdown (Deep Explanation)

Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await r_await w_await svctm %util

✅ 1. Device

  • Logical or physical disk name
    • sdX → physical disks
    • dm-X → device mapper (LVM, ASM, multipath)

👉 In your case:

  • dm-* = logical volumes / DB storage layers

✅ 2. rrqm/s (Read Requests Merged per second)

  • Number of read requests merged by OS scheduler

Why merging matters:

  • OS combines adjacent reads to reduce I/O calls

👉 Example:

  • 10 small reads → merged → 1 large read

Interpretation:

  • High value → efficient sequential I/O
  • Zero → either random I/O or already optimized

✅ 3. wrqm/s (Write Requests Merged per second)

  • Same as above but for writes

✅ High value:

  • Good for sequential writes (e.g., redo logs, batch loads)

✅ 4. r/s (Reads per second)

  • Number of read I/O operations per second

Interpretation:

  • High r/s = high IOPS (random access likely)

✅ 5. w/s (Writes per second)

  • Number of write operations per second

👉 Together with r/s:

  • Indicates workload type:
    • OLTP → high r/s + w/s, small IO
    • Analytics → lower r/s but large I/O size

✅ 6. rMB/s (Read throughput in MB/sec)

  • Total data read per second

✅ 7. wMB/s (Write throughput in MB/sec)

  • Total data written per second

🔎 Important:

PatternMeaning
High r/s + low rMB/ssmall random IO
Low r/s + high rMB/slarge sequential IO

✅ 8. avgrq-sz (Average Request Size)

  • Average size of each I/O request (in KB)

Formula:

avgrq-sz = (total sectors read+written) / total I/O ops

Interpretation:

ValueMeaning
< 32 KBrandom IO (OLTP)
64–256 KBmixed
~1024 KB (1MB)sequential scan

✅ 9. avgqu-sz (Average Queue Length)

  • Number of I/O requests waiting in queue

🚨 Critical metric:

ValueImpact
< 1healthy
1–5moderate
10+pressure
20+severe bottleneck

👉 High value means:

  • Disk is overloaded
  • Requests are waiting → latency increase

✅ 10. await (Average Wait Time in ms)

  • Total time for I/O request:
    wait time = queue time + service time
    

🚨 Thresholds:

ValueMeaning
< 5 msexcellent
5–20 msacceptable
20–50 mswarning
> 50 msserious issue

👉 This is the most important latency metric


✅ 11. r_await (Read latency)

  • Avg time for read requests

✅ 12. w_await (Write latency)

  • Avg time for write requests

Why split matters:

  • Helps identify:
    • read-heavy issues (full scan)
    • write bottlenecks (redo/log/file sync)

✅ 13. svctm (Service Time)

  • Time taken by disk to service request
  • Does NOT include queue time

Important:

await ≈ svctm + queue delay

Interpretation:

CaseMeaning
await ≈ svctmno queue bottleneck
await >> svctmqueue contention

👉 This is key for bottleneck detection


✅ 14. %util (Utilization)

  • Percentage of time disk was busy

🚨 Interpretation:

ValueMeaning
< 60%safe
60–80%moderate
80–90%high
> 90%saturated

👉 BUT:

  • Must combine with await + queue

🔥 Important Combined Interpretation

✅ Case 1 (Healthy high usage)

%util = 95%
await = 1 ms
avgqu-sz = 1

✔ Efficient disk


🚨 Case 2 (Bottleneck)

%util = 99%
await = 80 ms
avgqu-sz = 20

❌ Disk saturation + queue buildup


🧠 How You Should Read Header (DBA Cheat Sheet)

Step-by-step analysis:

  1. Check %util

    • 90 → possible saturation

  2. Check avgqu-sz

    • High → queue backlog
  3. Check await

    • Confirms latency impact
  4. Compare await vs svctm

    • Big gap → queue delay
  5. Check avgrq-sz

    • Understand workload type

🎯 Why This Matters for You (Database Architect)

This header directly helps identify:

✅ DB Issues Mapping

MetricDB Problem
High rMB/s + large avgrq-szfull table scan
High r/s, low sizeindex lookup
High w_awaitcommit / redo issues
High avgqu-szstorage contention
High awaitslow queries

✅ Final Summary

  • r/s, w/s → IOPS
  • rMB/s, wMB/s → throughput
  • avgrq-sz → IO size (random vs sequential)
  • avgqu-sz → pressure indicator 🚨
  • await → real latency 🚨
  • %util → saturation signal

Monday, May 25, 2026

Step-by-Step HugePages Configuration (Oracle 19c on Linux)

 

 Step-by-Step HugePages Configuration (Oracle 19c on Linux)


🔹 Step 1: Check Current HugePages Status

grep Huge /proc/meminfo

Key parameters:

  • HugePages_Total
  • HugePages_Free
  • Hugepagesize (usually 2 MB)

🔹 Step 2: Disable Transparent HugePages (THP)

Oracle recommends disabling THP.

Check status:

cat /sys/kernel/mm/transparent_hugepage/enabled

Disable temporarily:

echo never > /sys/kernel/mm/transparent_hugepage/enabled
echo never > /sys/kernel/mm/transparent_hugepage/defrag

Disable permanently:

Edit:

vi /etc/default/grub

Add:

transparent_hugepage=never

Then apply:

grub2-mkconfig -o /boot/grub2/grub.cfg
reboot

🔹 Step 3: Calculate Required HugePages

3.1 Check Oracle SGA size

From SQL:

show parameter sga_target;

or:

show parameter memory_target;


3.2 Use Oracle Script (Recommended)

$ORACLE_HOME/bin/hugepages_settings.sh

If not available, download from Oracle MOS.


3.3 Manual Calculation

Formula:

HugePages = Total_SGA_Size / HugePage_Size

Example:

  • SGA = 64 GB
  • Page size = 2 MB
64 * 1024 MB / 2 MB = 32768 HugePages

🔹 Step 4: Configure HugePages in Kernel

Edit:

vi /etc/sysctl.conf

Add/update:

vm.nr_hugepages=32768

Apply:

sysctl -p


🔹 Step 5: Configure memlock for Oracle User

Edit:

vi /etc/security/limits.conf

Add:

oracle soft memlock 67108864
oracle hard memlock 67108864

👉 Value should match SGA size (in KB)

Example:

64 GB = 67108864 KB

🔹 Step 6: Disable AMM (Very Important)

HugePages does NOT work with AMM.

Check:

show parameter memory_target;

If > 0, disable:

ALTER SYSTEM SET memory_target=0 SCOPE=SPFILE;
ALTER SYSTEM SET memory_max_target=0 SCOPE=SPFILE;

Instead, use ASMM:

ALTER SYSTEM SET sga_target=64G SCOPE=SPFILE;
ALTER SYSTEM SET pga_aggregate_target=8G SCOPE=SPFILE;

Restart DB:

shutdown immediate;
startup;

🔹 Step 7: Restart Server

reboot

🔹 Step 8: Validate HugePages Usage

grep Huge /proc/meminfo

Check:

  • HugePages_Total → should match configured value
  • HugePages_Free → should decrease after DB start
  • HugePages_Rsvd → reserved pages

🔹 Step 9: Verify Oracle is Using HugePages

grep -i huge /proc/meminfo

Also check alert log:

grep HugePages $ORACLE_BASE/diag/rdbms/*/*/trace/alert*.log

Expected message:

Huge Pages allocation successful

✅ Best Practices

✔ Always leave some RAM for OS (not 100% HugePages)
✔ Use static SGA (ASMM)
✔ Set HugePages slightly higher than requirement
✔ Monitor with:

vmstat
free -g

⚠️ Common Mistakes

❌ Using AMM (memory_target)
❌ Not setting memlock
❌ THP not disabled
❌ Underestimating number of HugePages


🚀 Quick Example Summary

For a 64 GB SGA:

SettingValue
HugePage size2 MB
Required pages32768
memlock67108864 KB
vm.nr_hugepages32768




Oracle Script for huge_page setting 

vi hugepages_settings.sh 


#!/bin/bash 
# hugepages_settings.sh 
# Linux bash script to compute values for the 
# recommended HugePages/HugeTLB configuration 
# on Oracle Linux 
# Note: This script does calculation for all shared memory 
# segments available when the script is run, no matter it 
# is an Oracle RDBMS shared memory segment or not. 
# This script is provided by KB151310 from My Oracle Support 
# http://support.oracle.com 

# Welcome text 
echo " 
This script is provided by KB151310 from My Oracle Support 
(http://support.oracle.com) where it is intended to compute values for 
the recommended HugePages/HugeTLB configuration for the current shared 
memory segments on Oracle Linux. Before proceeding with the execution please note following: 
 * For ASM instance, it needs to configure ASMM instead of AMM. 
 * The 'pga_aggregate_target' is outside the SGA and 
   you should accommodate this while calculating the overall size. 
 * In case you changes the DB SGA size, 
   as the new SGA will not fit in the previous HugePages configuration, 
   it had better disable the whole HugePages, 
   start the DB with new SGA size and run the script again. 
And make sure that: 
 * Oracle Database instance(s) are up and running 
 * Oracle Database Automatic Memory Management (AMM) is not setup 
   (See KB83222) 
 * The shared memory segments can be listed by command: 
     # ipcs -m 


Press Enter to proceed..." 

read 

# Check for the kernel version 
KERN=`uname -r | awk -F. '{ printf("%d.%d\n",$1,$2); }'` 

# Find out the HugePage size 
HPG_SZ=`grep Hugepagesize /proc/meminfo | awk '{print $2}'` 
if [ -z "$HPG_SZ" ];then 
    echo "The hugepages may not be supported in the system where the script is being executed." 
    exit 1 
fi 

# Initialize the counter 
NUM_PG=0 

# Cumulative number of pages required to handle the running shared memory segments 
for SEG_BYTES in `ipcs -m | cut -c44-300 | awk '{print $1}' | grep "[0-9][0-9]*"` 
do 
    MIN_PG=`echo "$SEG_BYTES/($HPG_SZ*1024)" | bc -q` 
    if [ $MIN_PG -gt 0 ]; then 
        NUM_PG=`echo "$NUM_PG+$MIN_PG+1" | bc -q` 
    fi 
done 

RES_BYTES=`echo "$NUM_PG * $HPG_SZ * 1024" | bc -q` 

# An SGA less than 100MB does not make sense 
# Bail out if that is the case 
if [ $RES_BYTES -lt 100000000 ]; then 
    echo "***********" 
    echo "** ERROR **" 
    echo "***********" 
    echo "Sorry! There are not enough total of shared memory segments allocated for 
HugePages configuration. HugePages can only be used for shared memory segments 
that you can list by command: 

    # ipcs -m 

of a size that can match an Oracle Database SGA. Please make sure that: 
 * Oracle Database instance is up and running 
 * Oracle Database Automatic Memory Management (AMM) is not configured" 
    exit 1 
fi 

# Finish with results 
    echo "Recommended setting: vm.nr_hugepages = $NUM_PG"; 

# End



Monday, May 18, 2026

step‑by‑step guide to check CPU sockets, cores, threads, and multithreading (Hyper‑Threading/SMT)

Step‑by‑step guide to check CPU sockets, cores, threads, and multithreading (Hyper‑Threading/SMT)

✅ 1. BASIC UNDERSTANDING (IMPORTANT)

TermMeaning
SocketPhysical CPU installed on motherboard
CorePhysical processing unit inside CPU
ThreadLogical CPU (via Hyper‑Threading / SMT)
Multithreading Enabled?If Threads > Cores

👉 Example:

  • 1 socket, 4 cores, 8 threads → Hyper‑Threading ENABLED
  • 1 socket, 4 cores, 4 threads → Hyper‑Threading DISABLED

🐧 2. LINUX – STEP BY STEP

🔹 Step 1: Check CPU details

Run:

lscpu

Example output:

CPU(s):              8
Socket(s):           1
Core(s) per socket:  4
Thread(s) per core:  2

Interpret:

  • Sockets = 1
  • Cores = 4
  • Threads = 8
  • Thread per core = 2 → Multithreading ENABLED ✅

🔹 Step 2: Check using /proc

cat /proc/cpuinfo | grep "physical id" | sort -u | wc -l

👉 Gives number of sockets

cat /proc/cpuinfo | grep "cpu cores" | uniq

👉 Shows cores per socket

cat /proc/cpuinfo | grep "processor" | wc -l

👉 Total logical CPUs (threads)


🔹 Step 3: Quick single command summary

nproc

👉 gives total threads


🔹 Step 4: Check Hyper‑Threading explicitly

lscpu | grep "Thread"

If:

Thread(s) per core: 2

👉 ✅ Hyper‑Threading ON

If:

Thread(s) per core: 1

👉 ❌ OFF


🪟 3. WINDOWS – STEP BY STEP

🔹 Step 1: Task Manager (GUI method)

  1. Press Ctrl + Shift + Esc
  2. Go to Performance tab → CPU
  3. Look at:
Sockets: X
Cores: Y
Logical processors: Z

👉 Example:

  • Cores = 4
  • Logical processors = 8
    ✅ Multithreading ENABLED

🔹 Step 2: Command Prompt

Run:

wmic cpu get NumberOfCores,NumberOfLogicalProcessors

Output:

NumberOfCores  NumberOfLogicalProcessors
4              8

👉 Threads (LogicalProcessors) > Cores
✅ Hyper‑Threading ON


🔹 Step 3: PowerShell (recommended)

Get-WmiObject Win32_Processor | Select NumberOfCores, NumberOfLogicalProcessors, SocketDesignation

OR modern cmdlet:

Get-CimInstance Win32_Processor | Select NumberOfCores,NumberOfLogicalProcessors



🔹 Step 4: System Information

Run:

msinfo32

Check:

  • Processor
  • Logical processors

🔍 4. HOW TO TELL IF MULTITHREADING IS ENABLED

✅ Rule:

If Threads > Cores → ENABLED
If Threads = Cores → DISABLED

Example:

CoresThreadsStatus
48✅ Enabled
816✅ Enabled
44❌ Disabled

🔧 5. BIOS LEVEL CHECK (Important)

Even if OS shows cores/threads, final control is in BIOS.

Steps:

  1. Reboot system
  2. Enter BIOS (F2 / DEL / ESC)
  3. Look for:
    • Intel Hyper‑Threading
    • AMD SMT (Simultaneous Multithreading)

Settings:

  • Enabled ✅ → uses threads
  • Disabled ❌ → only physical cores

✅ 6. QUICK CHEAT SHEET

Linux

lscpu

Windows

wmic cpu get NumberOfCores,NumberOfLogicalProcessors


🚀 FINAL SUMMARY

PlatformCommandWhat to Check
LinuxlscpuThreads per core
Linux/proc/cpuinfocores, sockets
WindowsTask ManagerLogical processors
Windowswmiccores vs threads
    



SQL> SELECT stat_name, value
FROM   v$osstat
WHERE  stat_name IN ('NUM_CPUS','NUM_CPU_CORES','NUM_CPU_SOCKETS');
``  2    3
STAT_NAME                           VALUE
------------------------------ ----------
NUM_CPUS                                8
NUM_CPU_CORES                           8
NUM_CPU_SOCKETS                         2

SQL>


Interpretation:

NUM_CPUS = logical CPUs visible to the OS (includes HT threads)
NUM_CPU_CORES = physical cores
If NUM_CPUS > NUM_CPU_CORES → SMT/HT enabled
If NUM_CPUS = NUM_CPU_CORES → SMT/HT disabled



SQL> !cat /proc/cpuinfo | egrep "processor|cpu cores|siblings|physical id" 

physical id     : 0
siblings        : 4
cpu cores       : 4
processor       : 1

cpu cores = physical cores per socket
siblings = logical CPUs per socket

If siblings > cpu cores → SMT/HT enabled
If siblings == cpu cores → SMT/HT disabled




Friday, May 15, 2026

Compare ELK Stack vs Splunk vs Prometheus + Grafana

 

🔷 1. What Each Stack Represents

StackComponentsFocus Area
ELK StackElasticsearch + Logstash + Kibana (+ Beats)Logging & search
SplunkSingle integrated platformLogging + analytics + security
Prometheus + GrafanaPrometheus + Alertmanager + GrafanaMetrics & monitoring

🔷 2. Architecture Overview

✅ ELK Stack (Open-source logging stack)

Sources → Beats/Logstash → Elasticsearch → Kibana
  • Logstash/Beats → Collect logs
  • Elasticsearch → Store & index logs
  • Kibana → Visualize logs

👉 Fully open-source (except Elastic licensing changes in newer versions)


✅ Splunk (Enterprise platform)

Sources → Forwarders → Splunk Indexer → Splunk UI
  • Collects logs via agents
  • Indexes data internally
  • Built-in dashboards + query engine

👉 Everything in one ecosystem


✅ Prometheus + Grafana (Monitoring stack)

Applications → Prometheus → Grafana → Alerts
  • Prometheus → collects metrics (pull model)
  • Grafana → visualization
  • Alertmanager → alerts

👉 Works best for real-time system health metrics


🔷 3. Core Differences (Very Important)

🔹 Data Type Focus

ToolData Type
PrometheusMetrics (numbers, time-series)
ELKLogs (text, JSON, events)
SplunkLogs + Events + Metrics

🔹 Ease of Use

ToolComplexity
Splunk✅ Easiest (plug-and-play)
ELK⚠️ Moderate (setup + tuning needed)
Prometheus✅ Easy for metrics, not logs

🔹 Cost

ToolCost
Prometheus + Grafana✅ Free
ELK Stack✅ Mostly free (some paid features)
Splunk❌ Very expensive (license based on data volume)

🔹 Query Language

ToolQuery
PrometheusPromQL
ELKLucene / KQL
SplunkSPL (very powerful)

🔷 4. When to Use What (Real Scenarios)

✅ Use Prometheus + Grafana when:

  • Monitoring DB performance (CPU, IO, connections)
  • Kubernetes / microservices environment
  • Need real-time alerting

👉 Example for you:

  • Oracle DB metrics
  • MongoDB performance monitoring
  • API latency tracking

✅ Use ELK Stack when:

  • You want centralized logging
  • Need search + troubleshooting
  • Want open-source flexibility

👉 Example:

  • DB audit logs
  • Application logs
  • Slow query logs

✅ Use Splunk when:

  • Enterprise-grade security + compliance (SOX, audit)
  • Need correlation across logs + events
  • Budget is not a constraint

👉 Example:

  • Audit trails (critical for SOX 👀)
  • Fraud detection
  • Security monitoring (SIEM)

🔷 5. Feature Comparison

FeatureELKSplunkPrometheus
Log Management
Metrics Monitoring⚠️ Limited
Visualization✅ (Kibana)✅ (Grafana)
Alerting
AI/ML Insights⚠️ Limited✅ Strong⚠️ Basic
Scalability✅ High✅ Very High✅ High
Setup Effort⚠️ Medium✅ Easy✅ Easy

🔷 6. Real-World Architecture (Recommended)

As a Database Architect, best practice is NOT to choose one—combine them:

✅ Modern Observability Stack

                ┌──────────────┐
                │ Applications │
                └──────┬───────┘
                       │
        ┌──────────────┼──────────────┐
        │                             │
   Metrics                        Logs
        │                             │
Prometheus                    ELK / Splunk
        │                             │
   Grafana                     Kibana / Splunk UI

🔷 7. Practical Recommendation for YOU

Since you're aiming toward Databricks / modern architecture, here’s a strong stack:

✅ Ideal Stack:

  • Prometheus + Grafana → Monitoring
  • ELK (or OpenSearch) → Logging (cost-effective)
  • Splunk (optional) → Only if:
    • You need SOX compliance
    • Heavy audit requirements

🔷 8. Quick Analogy

ToolAnalogy
PrometheusHeart rate monitor
GrafanaDashboard display
ELKDetailed diary/logbook
SplunkIntelligent investigator

✅ Final Summary (Simple)

  • Prometheus + Grafana → Monitoring (numbers)
  • ELK → Logging (open source)
  • Splunk → Enterprise logging + security

🔥 Pro Tip (Important for your role)

Given your SOX + DB architecture interest, focus on:

  • Prometheus → DB health monitoring
  • ELK/Splunk → audit logging + compliance
  • Grafana → unified dashboard for both

What and where we use Grafana, Splunk, and Prometheus ?

Grafana, Splunk, and Prometheus are all popular tools used in monitoring, logging, and observability, but they serve different purposes and are often used together rather than being direct replacements.

Here’s a clear breakdown:


🔹 1. High-Level Difference

ToolPrimary PurposeWhat it Does Best
GrafanaVisualization & dashboardsDisplays data from multiple sources
PrometheusMetrics collection & monitoringStores and queries time-series metrics
SplunkLog management & analyticsCentralized log search & analysis

🔹 2. Detailed Explanation

✅ Grafana

  • Role: Visualization layer
  • Type: Dashboarding tool
  • Core Function:
    Displays data from sources like Prometheus, Elasticsearch, SQL DBs, etc.

Key Features:

  • Rich dashboards & graphs
  • Alerts & notifications
  • Supports multiple data sources
  • Great UI for business + tech users

Example:

You can use Grafana to:

  • Show CPU usage graphs
  • Visualize API latency trends
  • Create executive dashboards

👉 Important: Grafana does NOT store data itself (except limited cases). It reads from data sources.


✅ Prometheus

  • Role: Monitoring & metrics collection
  • Type: Time-series database
  • Core Function:
    Collects numerical metrics from systems and stores them.

Key Features:

  • Pull-based monitoring (scrapes metrics from endpoints)
  • Uses PromQL (query language)
  • Built-in alerting (Alertmanager)
  • Ideal for Kubernetes / cloud-native setups

Example Metrics:

  • CPU usage
  • Memory consumption
  • Request rates
  • Error counts

👉 Prometheus is focused on metrics only, not logs.


✅ Splunk

  • Role: Log management & analytics platform
  • Type: Enterprise data platform
  • Core Function:
    Collects, indexes, and analyzes machine-generated data (logs).

Key Features:

  • Powerful log search (SPL – Search Processing Language)
  • Real-time log monitoring
  • Security (SIEM), compliance, and audit use cases
  • Handles structured & unstructured data

Example Logs:

  • Application logs
  • Server logs
  • Security logs
  • Audit trails

👉 Splunk is not just monitoring—it’s full-scale data analytics + security platform.


🔹 3. Key Differences (Simple View)

FeatureGrafanaPrometheusSplunk
Stores Data❌ No (mostly)✅ Yes✅ Yes
Data TypeVisual outputMetrics (numbers)Logs + events
Query LanguageDepends on sourcePromQLSPL
Use CaseDashboardsMonitoring systemsLog analysis & security
CostFree/Open SourceFree/Open SourcePaid (expensive)

🔹 4. How They Work Together

A very common real-world setup:

Prometheus → collects metrics
       ↓
Grafana → visualizes metrics

And separately:

Applications → send logs → Splunk → analyze/search logs

👉 In modern architectures:

  • Prometheus + Grafana = Monitoring stack
  • Splunk = Logging + Security analytics

🔹 5. When to Use What

Use Prometheus when:

  • You need system/service monitoring
  • You work with Kubernetes or microservices
  • You need metrics-based alerting

Use Grafana when:

  • You want beautiful dashboards
  • You need to visualize multiple data sources
  • You want business + technical insights

Use Splunk when:

  • You need centralized logging
  • You need security monitoring (SIEM)
  • You want deep log analysis and auditing

🔹 6. Easy Analogy

Think of a car dashboard:

  • Prometheus → collects speed, fuel, engine data
  • Grafana → shows it neatly on the dashboard
  • Splunk → records everything that happened (every trip, error, warning)

✅ Final Summary

  • Grafana → “Show me the data”
  • Prometheus → “Collect and store metrics”
  • Splunk → “Analyze logs and events”

Setup and design database architecture to meet SLA - 99.9 , 99.99 , 99.999

Achieving distinct Service Level Agreements (SLAs) for a database requires scaling redundancy and infrastructure complexity.  The three-nine...