Database Query Optimization Techniques
Database Query Optimization Techniques
Your application is slow. Users are complaining. You check the metrics and discover the painful truth: 80% of your response time is waiting for the database.
I’ve been there. Multiple times. And I’ve learned that database optimization isn’t about memorizing SQL tricks - it’s about understanding how databases think, what they’re doing behind the scenes, and how to work with them instead of against them.
Today I’m sharing the database optimization techniques that have saved my projects from performance disasters.
Understanding Query Execution
Before we optimize, we need to understand what’s actually happening when you run a query.
The Query Execution Pipeline
Your SQL Query
↓
Parser (syntax check)
↓
Planner/Optimizer (creates execution plan)
↓
Executor (runs the plan)
↓
Result
The key insight: The database doesn’t just “run” your query. It analyzes it, considers multiple strategies, estimates costs, and chooses what it thinks is the best approach.
Sometimes it’s wrong. Our job is to help it make better decisions.
Reading EXPLAIN Plans
This is the single most important skill for database optimization.
PostgreSQL Example:
EXPLAIN ANALYZE
SELECT users.name, COUNT(posts.id) as post_count
FROM users
LEFT JOIN posts ON users.id = posts.user_id
WHERE users.created_at > '2025-01-01'
GROUP BY users.id, users.name
ORDER BY post_count DESC
LIMIT 10;
Output:
Limit (cost=52341.24..52341.26 rows=10 width=24) (actual time=423.156..423.159 rows=10 loops=1)
-> Sort (cost=52341.24..52466.37 rows=50050 width=24) (actual time=423.154..423.156 rows=10 loops=1)
Sort Key: (count(posts.id)) DESC
Sort Method: top-N heapsort Memory: 25kB
-> HashAggregate (cost=50217.37..50717.87 rows=50050 width=24) (actual time=418.234..421.445 rows=48234 loops=1)
Group Key: users.id
-> Hash Left Join (cost=12453.23..45217.12 rows=500025 width=16) (actual time=85.234..312.456 rows=500025 loops=1)
Hash Cond: (posts.user_id = users.id)
-> Seq Scan on posts (cost=0.00..18234.25 rows=500025 width=8) (actual time=0.034..98.234 rows=500025 loops=1)
-> Hash (cost=11234.50..11234.50 rows=50050 width=16) (actual time=85.123..85.123 rows=48234 loops=1)
Buckets: 65536 Batches: 1 Memory Usage: 2891kB
-> Seq Scan on users (cost=0.00..11234.50 rows=50050 width=16) (actual time=0.023..78.234 rows=48234 loops=1)
Filter: (created_at > '2025-01-01'::date)
Rows Removed by Filter: 12234
Planning Time: 2.345 ms
Execution Time: 423.234 ms
What to look for:
- Sequential Scans - Reading entire table (slow for large tables)
- Actual time vs. Estimated - If wildly different, statistics are outdated
- High cost operations - Where most time is spent
- Rows removed by filter - Wasted work
Indexing Strategies
Indexes are your first and most powerful optimization tool. But they’re not magic - you need to use them correctly.
When to Add an Index
Add indexes for:
- Columns in WHERE clauses
- Columns in JOIN conditions
- Columns in ORDER BY
- Columns in GROUP BY
- Columns with high cardinality (many unique values)
Don’t index:
- Small tables (< 1000 rows)
- Columns with low cardinality (few unique values)
- Columns that change frequently in write-heavy tables
- Columns you never query
Index Types
B-Tree Index (Default)
Best for equality and range queries.
-- Single column index
CREATE INDEX idx_users_email ON users(email);
-- Multi-column index (column order matters!)
CREATE INDEX idx_posts_user_date ON posts(user_id, created_at);
-- This index can be used for:
-- WHERE user_id = 123
-- WHERE user_id = 123 AND created_at > '2025-01-01'
-- WHERE user_id = 123 ORDER BY created_at
-- But NOT efficiently for:
-- WHERE created_at > '2025-01-01' (doesn't use first column)
Partial Index
Index only a subset of rows.
-- Only index active users
CREATE INDEX idx_active_users_email ON users(email) WHERE active = true;
-- Only index recent posts
CREATE INDEX idx_recent_posts ON posts(created_at) WHERE created_at > NOW() - INTERVAL '30 days';
-- Much smaller index, faster updates, perfect for common queries
Covering Index (INCLUDE)
Include additional columns in the index to avoid table lookups.
-- Index on email, but include name and created_at
CREATE INDEX idx_users_email_covering ON users(email) INCLUDE (name, created_at);
-- This query can be satisfied entirely from the index (no table lookup):
SELECT email, name, created_at FROM users WHERE email = 'user@example.com';
Expression Index
Index computed values.
-- Index lowercase email for case-insensitive searches
CREATE INDEX idx_users_lower_email ON users(LOWER(email));
-- Now this is fast:
SELECT * FROM users WHERE LOWER(email) = 'user@example.com';
-- Index JSONB field
CREATE INDEX idx_settings_timezone ON users((settings->>'timezone'));
Full-Text Search Index
For text search operations.
-- Add tsvector column
ALTER TABLE articles ADD COLUMN search_vector tsvector
GENERATED ALWAYS AS (to_tsvector('english', title || ' ' || content)) STORED;
-- Create GIN index
CREATE INDEX idx_articles_search ON articles USING GIN(search_vector);
-- Fast full-text search
SELECT * FROM articles
WHERE search_vector @@ plainto_tsquery('english', 'database optimization');
GiST/GIN Index
For complex data types (arrays, JSON, geometry).
-- Index for array containment queries
CREATE INDEX idx_tags_gin ON posts USING GIN(tags);
-- Fast array queries:
SELECT * FROM posts WHERE tags @> ARRAY['postgresql', 'performance'];
-- Index for JSONB
CREATE INDEX idx_metadata_gin ON events USING GIN(metadata);
-- Fast JSON queries:
SELECT * FROM events WHERE metadata @> '{"status": "completed"}';
Index Maintenance
Indexes aren’t set-it-and-forget-it.
Monitor Index Usage:
-- Find unused indexes
SELECT
schemaname,
tablename,
indexname,
idx_scan as index_scans,
idx_tup_read as tuples_read,
idx_tup_fetch as tuples_fetched,
pg_size_pretty(pg_relation_size(indexrelid)) as index_size
FROM pg_stat_user_indexes
WHERE idx_scan = 0
AND indexrelname NOT LIKE 'pg_toast%'
ORDER BY pg_relation_size(indexrelid) DESC;
-- Drop unused indexes:
-- DROP INDEX idx_unused_index;
Monitor Index Health:
-- Check for index bloat
SELECT
schemaname,
tablename,
indexname,
pg_size_pretty(pg_relation_size(indexrelid)) as index_size,
idx_scan,
idx_tup_read,
idx_tup_fetch
FROM pg_stat_user_indexes
ORDER BY pg_relation_size(indexrelid) DESC;
-- Rebuild bloated indexes:
-- REINDEX INDEX CONCURRENTLY idx_bloated_index;
Query Optimization Patterns
The N+1 Query Problem
The most common performance killer.
Bad (N+1 queries):
// Get all users
const users = await db.query('SELECT * FROM users LIMIT 100');
// For each user, get their posts (100 queries!)
for (let user of users) {
user.posts = await db.query('SELECT * FROM posts WHERE user_id = ?', user.id);
}
Good (2 queries):
// Get all users
const users = await db.query('SELECT * FROM users LIMIT 100');
const userIds = users.map(u => u.id);
// Get all posts in one query
const posts = await db.query(
'SELECT * FROM posts WHERE user_id = ANY(?)',
[userIds]
);
// Group posts by user
const postsByUser = posts.reduce((acc, post) => {
if (!acc[post.user_id]) acc[post.user_id] = [];
acc[post.user_id].push(post);
return acc;
}, {});
// Attach posts to users
users.forEach(user => {
user.posts = postsByUser[user.id] || [];
});
Best (1 query with JOIN):
SELECT
users.*,
json_agg(
json_build_object(
'id', posts.id,
'title', posts.title,
'created_at', posts.created_at
)
) as posts
FROM users
LEFT JOIN posts ON users.id = posts.user_id
WHERE users.active = true
GROUP BY users.id
LIMIT 100;
Pagination Done Right
Wrong approach (OFFSET):
-- Gets slower as you paginate deeper
SELECT * FROM posts
ORDER BY created_at DESC
LIMIT 50 OFFSET 10000; -- Has to scan 10,050 rows to skip first 10,000
Right approach (Cursor-based):
-- First page
SELECT * FROM posts
ORDER BY created_at DESC, id DESC
LIMIT 50;
-- Next page (using last row's values)
SELECT * FROM posts
WHERE (created_at, id) < ('2025-10-20 15:30:00', 12345)
ORDER BY created_at DESC, id DESC
LIMIT 50;
-- Always scans only 50 rows, regardless of page depth
Implementation:
async function getPosts(cursor = null, limit = 50) {
let query = 'SELECT * FROM posts';
let params = [];
if (cursor) {
// cursor = base64encode(JSON.stringify({created_at, id}))
const decoded = JSON.parse(Buffer.from(cursor, 'base64').toString());
query += ' WHERE (created_at, id) < ($1, $2)';
params = [decoded.created_at, decoded.id];
}
query += ' ORDER BY created_at DESC, id DESC LIMIT $' + (params.length + 1);
params.push(limit);
const posts = await db.query(query, params);
// Generate next cursor
const nextCursor = posts.length === limit
? Buffer.from(JSON.stringify({
created_at: posts[posts.length - 1].created_at,
id: posts[posts.length - 1].id
})).toString('base64')
: null;
return { posts, nextCursor };
}
Counting Efficiently
Slow (accurate but slow):
-- Scans entire table
SELECT COUNT(*) FROM posts WHERE category = 'tech';
Fast (approximate):
-- Use table statistics (fast but approximate)
SELECT reltuples::BIGINT AS estimate
FROM pg_class
WHERE relname = 'posts';
-- For filtered counts, use EXPLAIN:
EXPLAIN SELECT COUNT(*) FROM posts WHERE category = 'tech';
-- Extract row estimate from output
Better (cached count):
-- Maintain counts in a separate table
CREATE TABLE category_stats (
category VARCHAR(50) PRIMARY KEY,
post_count INTEGER DEFAULT 0,
updated_at TIMESTAMP DEFAULT NOW()
);
-- Update via trigger or periodic job
CREATE OR REPLACE FUNCTION update_category_count()
RETURNS TRIGGER AS $$
BEGIN
IF TG_OP = 'INSERT' THEN
INSERT INTO category_stats (category, post_count)
VALUES (NEW.category, 1)
ON CONFLICT (category)
DO UPDATE SET post_count = category_stats.post_count + 1;
ELSIF TG_OP = 'DELETE' THEN
UPDATE category_stats
SET post_count = post_count - 1
WHERE category = OLD.category;
END IF;
RETURN NULL;
END;
$$ LANGUAGE plpgsql;
CREATE TRIGGER posts_category_count
AFTER INSERT OR DELETE ON posts
FOR EACH ROW EXECUTE FUNCTION update_category_count();
Batching Operations
Bad (N queries):
for (let user of users) {
await db.query(
'INSERT INTO audit_log (user_id, action) VALUES (?, ?)',
[user.id, 'login']
);
}
Good (1 query):
const values = users.map(u => `(${u.id}, 'login')`).join(',');
await db.query(`
INSERT INTO audit_log (user_id, action)
VALUES ${values}
`);
Better (using parameterized batch insert):
// PostgreSQL
const values = users.map((u, i) => `($${i*2+1}, $${i*2+2})`).join(',');
const params = users.flatMap(u => [u.id, 'login']);
await db.query(
`INSERT INTO audit_log (user_id, action) VALUES ${values}`,
params
);
Advanced Optimization Techniques
Query Materialization
Pre-compute expensive queries.
-- Create materialized view
CREATE MATERIALIZED VIEW user_post_stats AS
SELECT
users.id,
users.name,
COUNT(posts.id) as post_count,
MAX(posts.created_at) as last_post_at,
AVG(posts.view_count) as avg_views
FROM users
LEFT JOIN posts ON users.id = posts.user_id
GROUP BY users.id, users.name;
-- Create index on materialized view
CREATE INDEX idx_user_post_stats_count ON user_post_stats(post_count DESC);
-- Refresh periodically
REFRESH MATERIALIZED VIEW CONCURRENTLY user_post_stats;
-- Now queries are instant:
SELECT * FROM user_post_stats ORDER BY post_count DESC LIMIT 10;
Automatic Refresh:
-- Create refresh function
CREATE OR REPLACE FUNCTION refresh_user_stats()
RETURNS void AS $$
BEGIN
REFRESH MATERIALIZED VIEW CONCURRENTLY user_post_stats;
END;
$$ LANGUAGE plpgsql;
-- Schedule with pg_cron
SELECT cron.schedule('refresh-user-stats', '*/15 * * * *', 'SELECT refresh_user_stats()');
Partial Data Loading
Don’t load what you don’t need.
Bad:
-- Loading huge JSONB field
SELECT * FROM documents WHERE id = 123;
Good:
-- Load only needed fields
SELECT id, title, metadata->>'author' as author FROM documents WHERE id = 123;
-- Load JSONB field separately if needed
SELECT content FROM documents WHERE id = 123;
Connection Pooling
Database connections are expensive to create.
const { Pool } = require('pg');
const pool = new Pool({
host: 'localhost',
database: 'mydb',
user: 'user',
password: 'password',
max: 20, // Maximum pool size
idleTimeoutMillis: 30000, // Close idle connections after 30s
connectionTimeoutMillis: 2000, // Fail fast if no connection available
});
// Usage
async function getUser(id) {
const client = await pool.connect();
try {
const result = await client.query('SELECT * FROM users WHERE id = $1', [id]);
return result.rows[0];
} finally {
client.release();
}
}
Read Replicas
Distribute read load across multiple databases.
const masterPool = new Pool({ /* master config */ });
const replicaPool = new Pool({ /* replica config */ });
async function executeQuery(query, params, options = {}) {
const pool = options.writeable ? masterPool : replicaPool;
return pool.query(query, params);
}
// Reads from replica
const users = await executeQuery('SELECT * FROM users WHERE active = true', []);
// Writes to master
await executeQuery(
'INSERT INTO users (name, email) VALUES ($1, $2)',
['John', 'john@example.com'],
{ writeable: true }
);
Real-World Optimization Case Study
Let me walk through a recent optimization I did for a reporting dashboard.
The Problem
Dashboard query taking 45 seconds:
SELECT
DATE(created_at) as date,
COUNT(DISTINCT user_id) as unique_users,
COUNT(*) as total_events,
COUNT(*) FILTER (WHERE event_type = 'purchase') as purchases,
SUM(amount) FILTER (WHERE event_type = 'purchase') as revenue
FROM events
WHERE created_at BETWEEN '2024-01-01' AND '2025-12-31'
GROUP BY DATE(created_at)
ORDER BY date DESC;
Table size: 500M rows Time: 45 seconds Users: Angry
Step 1: Analyze the Query
EXPLAIN ANALYZE [query above]
-- Key findings:
-- Sequential scan on 500M rows
-- No index on created_at
-- Computing DATE() on every row
-- Multiple COUNT DISTINCT operations
Step 2: Add Basic Index
CREATE INDEX idx_events_created ON events(created_at);
-- Time: 45s → 32s (28% improvement)
-- Still not good enough
Step 3: Add Computed Date Column
-- Add date column
ALTER TABLE events ADD COLUMN event_date DATE;
-- Populate it
UPDATE events SET event_date = DATE(created_at);
-- Keep it updated
CREATE OR REPLACE FUNCTION set_event_date()
RETURNS TRIGGER AS $$
BEGIN
NEW.event_date := DATE(NEW.created_at);
RETURN NEW;
END;
$$ LANGUAGE plpgsql;
CREATE TRIGGER events_set_date
BEFORE INSERT OR UPDATE ON events
FOR EACH ROW EXECUTE FUNCTION set_event_date();
-- Index the date column
CREATE INDEX idx_events_date ON events(event_date);
-- Time: 32s → 18s (44% improvement from baseline)
-- Better, but still slow
Step 4: Create Materialized View
CREATE MATERIALIZED VIEW daily_event_stats AS
SELECT
event_date as date,
COUNT(DISTINCT user_id) as unique_users,
COUNT(*) as total_events,
COUNT(*) FILTER (WHERE event_type = 'purchase') as purchases,
SUM(amount) FILTER (WHERE event_type = 'purchase') as revenue
FROM events
GROUP BY event_date;
CREATE INDEX idx_daily_stats_date ON daily_event_stats(date DESC);
-- Query the materialized view
SELECT * FROM daily_event_stats
WHERE date BETWEEN '2024-01-01' AND '2025-12-31'
ORDER BY date DESC;
-- Time: 18s → 0.05s (99.9% improvement!)
Step 5: Incremental Updates
Instead of full refresh, update incrementally:
-- Function to update yesterday's stats
CREATE OR REPLACE FUNCTION refresh_daily_stats()
RETURNS void AS $$
BEGIN
-- Delete yesterday's stats
DELETE FROM daily_event_stats
WHERE date >= CURRENT_DATE - INTERVAL '1 day';
-- Recompute yesterday's stats
INSERT INTO daily_event_stats
SELECT
event_date as date,
COUNT(DISTINCT user_id) as unique_users,
COUNT(*) as total_events,
COUNT(*) FILTER (WHERE event_type = 'purchase') as purchases,
SUM(amount) FILTER (WHERE event_type = 'purchase') as revenue
FROM events
WHERE event_date >= CURRENT_DATE - INTERVAL '1 day'
GROUP BY event_date;
END;
$$ LANGUAGE plpgsql;
-- Run every hour
SELECT cron.schedule('refresh-stats', '0 * * * *', 'SELECT refresh_daily_stats()');
The Results
Performance:
- Query time: 45s → 0.05s (99.9% improvement)
- Dashboard load time: 50s → 0.2s
- Database CPU: 85% → 5%
Impact:
- Users actually use the dashboard now
- Can add more reports without performance concerns
- Database handles 10x more concurrent users
Trade-offs:
- Data delayed by up to 1 hour (acceptable for this use case)
- Additional storage for materialized view (minimal cost)
- Slightly more complex maintenance
Database-Specific Tips
PostgreSQL
Use pg_stat_statements:
-- Enable extension
CREATE EXTENSION pg_stat_statements;
-- Find slowest queries
SELECT
round(total_exec_time::numeric, 2) AS total_time,
calls,
round(mean_exec_time::numeric, 2) AS mean,
round((100 * total_exec_time / sum(total_exec_time) OVER ())::numeric, 2) AS percentage,
query
FROM pg_stat_statements
ORDER BY total_exec_time DESC
LIMIT 20;
Vacuum regularly:
-- Check table bloat
SELECT
schemaname,
tablename,
pg_size_pretty(pg_total_relation_size(schemaname||'.'||tablename)) AS size,
n_dead_tup,
n_live_tup,
round(n_dead_tup * 100.0 / NULLIF(n_live_tup + n_dead_tup, 0), 2) AS dead_ratio
FROM pg_stat_user_tables
WHERE n_dead_tup > 1000
ORDER BY n_dead_tup DESC;
-- Manual vacuum
VACUUM ANALYZE table_name;
-- Configure autovacuum for high-traffic tables
ALTER TABLE high_traffic_table SET (autovacuum_vacuum_scale_factor = 0.05);
MySQL
Use Query Cache (MySQL 5.7):
-- Enable query cache
SET GLOBAL query_cache_size = 67108864; -- 64MB
SET GLOBAL query_cache_type = ON;
-- Check cache stats
SHOW STATUS LIKE 'Qcache%';
Optimize Table:
-- Rebuild table and indexes
OPTIMIZE TABLE table_name;
Monitoring and Maintenance
Set up proper monitoring to catch issues early.
Key Metrics to Monitor
-- Active connections
SELECT count(*) FROM pg_stat_activity WHERE state = 'active';
-- Long-running queries
SELECT
pid,
now() - query_start as duration,
state,
query
FROM pg_stat_activity
WHERE state != 'idle'
AND query_start < now() - interval '5 minutes'
ORDER BY duration DESC;
-- Database size
SELECT
pg_database.datname,
pg_size_pretty(pg_database_size(pg_database.datname)) AS size
FROM pg_database
ORDER BY pg_database_size(pg_database.datname) DESC;
-- Table sizes
SELECT
schemaname,
tablename,
pg_size_pretty(pg_total_relation_size(schemaname||'.'||tablename)) AS size,
pg_size_pretty(pg_relation_size(schemaname||'.'||tablename)) AS table_size,
pg_size_pretty(pg_total_relation_size(schemaname||'.'||tablename) - pg_relation_size(schemaname||'.'||tablename)) AS indexes_size
FROM pg_tables
ORDER BY pg_total_relation_size(schemaname||'.'||tablename) DESC
LIMIT 20;
Automated Alerts
// Example monitoring script
async function checkDatabaseHealth() {
// Check for long-running queries
const longQueries = await db.query(`
SELECT count(*) as count
FROM pg_stat_activity
WHERE state = 'active'
AND query_start < now() - interval '1 minute'
`);
if (longQueries.rows[0].count > 10) {
await alerting.send('Too many long-running queries');
}
// Check connection pool
const connections = await db.query(`
SELECT count(*) as count FROM pg_stat_activity
`);
if (connections.rows[0].count > 90) { // max is 100
await alerting.send('Connection pool nearly exhausted');
}
// Check for table bloat
const bloat = await db.query(`
SELECT count(*) as count
FROM pg_stat_user_tables
WHERE n_dead_tup > n_live_tup * 0.2
`);
if (bloat.rows[0].count > 0) {
await alerting.send('Tables need vacuuming');
}
}
// Run every 5 minutes
setInterval(checkDatabaseHealth, 5 * 60 * 1000);
Common Mistakes to Avoid
Using SELECT *
Bad:
SELECT * FROM users WHERE id = 123;
-- Loads all columns, including potentially large BLOBs
Good:
SELECT id, name, email FROM users WHERE id = 123;
-- Loads only needed data
Premature Optimization
Don’t add indexes before you have performance problems. Every index:
- Slows down writes
- Takes up disk space
- Needs maintenance
Start simple, measure, then optimize.
Ignoring EXPLAIN
Never deploy a query without understanding its execution plan.
Over-Indexing
I’ve seen tables with 20+ indexes. That’s usually a code smell.
If you have that many indexes, consider:
- Are you querying in too many different ways?
- Should you use a different data model?
- Do you need a search engine (Elasticsearch) instead?
Your Database Optimization Checklist
For Every Slow Query:
- Run EXPLAIN ANALYZE
- Check for sequential scans
- Verify statistics are up to date
- Add appropriate indexes
- Rewrite query if needed
- Measure improvement
For Production Databases:
- Monitor slow query log
- Set up index usage monitoring
- Configure autovacuum appropriately
- Use connection pooling
- Implement query timeout
- Set up alerts for long queries
- Regular backup and recovery testing
The Bottom Line
Database optimization is a continuous process, not a one-time fix. The key principles:
- Measure first - Use EXPLAIN, don’t guess
- Index strategically - Not too few, not too many
- Rewrite queries - Sometimes the problem is the approach
- Cache when appropriate - Materialized views, application caching
- Monitor continuously - Catch issues before users complain
Master these techniques and you’ll never fear database performance problems again.
Part of the Developer Skills series focusing on technical excellence and performance optimization.
What’s your worst database performance story? Have you discovered optimization techniques that saved your project? Let me know - I learn from every developer’s experience!