Database Sharding Strategies for Scale: From Theory to Production

TL;DR

Database sharding is horizontal partitioning - splitting data across multiple database instances to scale beyond single-server limits. It’s powerful but complex. Shard when you’ve exhausted vertical scaling (bigger servers) and read replicas. Choose your shard key carefully - it determines data distribution and query patterns. Common strategies include range-based, hash-based, and geographic sharding. Implement routing logic, handle cross-shard queries, and plan for resharding. Sharding isn’t free - it adds complexity, but when done right, it enables unlimited horizontal scale.

Key Concepts:

Sharding = horizontal partitioning across multiple databases
Shard key determines which database stores each record
Range-based: Shard by value ranges (user IDs 1-1M, 1M-2M)
Hash-based: Shard by hash of key (consistent hashing)
Geographic: Shard by location (EU data, US data)
Trade-offs: Complexity vs scalability

What is Sharding?

Sharding is splitting your database horizontally across multiple servers.

Vertical scaling: Bigger server (more CPU, RAM, storage) Horizontal scaling: More servers (distribute data across them)

Single Database (No Sharding)

┌─────────────────────────┐
│   PostgreSQL Server     │
│                         │
│  All 100M user records  │
│  All 500M transactions  │
│  All application data   │
└─────────────────────────┘

Limits:

CPU: Single server’s processing power
RAM: Limited memory for caching
Storage: Single server’s disk capacity
I/O: Single disk’s throughput

Eventually you hit the ceiling.

Sharded Database

┌─────────────┐  ┌─────────────┐  ┌─────────────┐  ┌─────────────┐
│  Shard 1    │  │  Shard 2    │  │  Shard 3    │  │  Shard 4    │
│             │  │             │  │             │  │             │
│  Users      │  │  Users      │  │  Users      │  │  Users      │
│  1-25M      │  │  25M-50M    │  │  50M-75M    │  │  75M-100M   │
└─────────────┘  └─────────────┘  └─────────────┘  └─────────────┘

Benefits:

4x CPU capacity
4x RAM for caching
4x storage capacity
4x I/O throughput

Horizontal scale = add more shards.

When Do You Need Sharding?

Sharding is complex. Only do it when necessary.

Exhaust These First

1. Optimize queries

-- Add indexes
CREATE INDEX idx_users_email ON users(email);

-- Analyze slow queries
EXPLAIN ANALYZE SELECT * FROM users WHERE email = 'user@example.com';

2. Vertical scaling Upgrade to bigger server:

8 cores → 32 cores
32GB RAM → 256GB RAM
HDD → NVMe SSD

Modern servers are powerful. You can get far with vertical scaling.

3. Read replicas

         ┌──────────┐
         │  Primary │  (Writes)
         └────┬─────┘
              │
      ┌───────┴───────┐
      │               │
┌─────▼────┐    ┌─────▼────┐
│ Replica 1│    │ Replica 2│  (Reads)
└──────────┘    └──────────┘

Scale reads horizontally. Most apps are read-heavy.

4. Caching

Application → Redis (cache) → Database

Cache hot data. Reduce database load by 80%+.

5. Partitioning (single database)

-- PostgreSQL partitioning
CREATE TABLE orders (
  id BIGSERIAL,
  created_at TIMESTAMP,
  ...
) PARTITION BY RANGE (created_at);

CREATE TABLE orders_2025_q1 PARTITION OF orders
  FOR VALUES FROM ('2025-01-01') TO ('2025-04-01');

Partition within single database first.

Shard When

You need sharding when:

1. Data exceeds single server capacity

100M+ users
Billions of records
Multi-terabyte datasets

2. Write traffic exceeds single server

100K+ writes/second
High concurrent transactions
Write-heavy workload

3. Geographic distribution required

EU users → EU database (GDPR compliance)
US users → US database (latency)
Asia users → Asia database (performance)

4. Tenant isolation

Enterprise customers get dedicated shard
Security/compliance requirements

Don’t shard prematurely. It’s a one-way door. Hard to un-shard.

Sharding Strategies

Choosing the right strategy is critical.

1. Range-Based Sharding

Partition by value ranges.

Example: User ID ranges

Shard 1: user_id 1 to 25,000,000
Shard 2: user_id 25,000,001 to 50,000,000
Shard 3: user_id 50,000,001 to 75,000,000
Shard 4: user_id 75,000,001 to 100,000,000

Routing logic:

function getShardForUser(userId) {
  if (userId <= 25_000_000) return 'shard1';
  if (userId <= 50_000_000) return 'shard2';
  if (userId <= 75_000_000) return 'shard3';
  return 'shard4';
}

Pros:

Simple to understand
Easy to add new ranges (new shard for 100M-125M)
Range queries stay on single shard
Good for time-series data

Cons:

Uneven distribution if data isn’t uniform
Hotspots (newest users on latest shard get all writes)
Requires planning ranges upfront

Best for:

Time-series data (logs, events, transactions)
Auto-incrementing IDs
Sequential data

Example: Time-based sharding

Shard 1: transactions from 2023
Shard 2: transactions from 2024
Shard 3: transactions from 2025

Recent data (hot) on separate shard from historical (cold).

2. Hash-Based Sharding

Partition by hash of key.

Example: Hash user ID

function getShardForUser(userId) {
  const hash = murmurhash(userId);
  const shardIndex = hash % NUM_SHARDS;
  return `shard${shardIndex + 1}`;
}

Distribution:

User 123   → hash(123) % 4 = 2 → Shard 2
User 456   → hash(456) % 4 = 1 → Shard 1
User 789   → hash(789) % 4 = 3 → Shard 3
User 1000  → hash(1000) % 4 = 0 → Shard 0

Pros:

Even distribution (no hotspots)
Deterministic (same key always goes to same shard)
No pre-planning of ranges

Cons:

Resharding is painful (changing number of shards rehashes everything)
Range queries span all shards
Related data might be split

Best for:

Evenly distributed keys
Point lookups (get by ID)
Avoiding hotspots

Consistent Hashing (Better):

// Use consistent hashing to minimize data movement when adding shards
const ring = new ConsistentHashRing(['shard1', 'shard2', 'shard3']);

function getShardForUser(userId) {
  return ring.getNode(userId);
}

// Adding shard4 only moves ~25% of data (1/4), not ~75%
ring.addNode('shard4');

3. Geographic Sharding

Partition by location.

Example: User location

Shard US-East:  Users in North America
Shard EU-West:  Users in Europe
Shard AP-South: Users in Asia-Pacific

Routing:

function getShardForUser(userId) {
  const user = getUserMetadata(userId);

  if (user.region === 'us') return 'shard-us-east';
  if (user.region === 'eu') return 'shard-eu-west';
  if (user.region === 'ap') return 'shard-ap-south';
}

Pros:

Low latency (data near users)
Compliance (GDPR, data sovereignty)
Geographic fault isolation

Cons:

Uneven distribution (more US users than EU)
Cross-region queries are slow/complex
User migration between regions

Best for:

Global applications
Compliance requirements
Latency-sensitive apps

4. Entity/Tenant-Based Sharding

Partition by entity or tenant.

Example: Enterprise SaaS

Shard 1: Customers A-M
Shard 2: Customers N-Z
Shard 3: Large enterprise customers

Routing:

function getShardForCustomer(customerId) {
  const customer = getCustomer(customerId);

  if (customer.tier === 'enterprise') {
    return customer.dedicatedShard; // 'shard-acme-corp'
  }

  const firstLetter = customer.name[0].toUpperCase();
  if (firstLetter <= 'M') return 'shard1';
  return 'shard2';
}

Pros:

Tenant isolation (security, compliance)
Per-tenant optimization
Easy to migrate individual tenants

Cons:

Uneven load if tenants vary in size
More complex routing

Best for:

Multi-tenant SaaS
B2B applications
Customers requiring isolation

5. Hybrid Sharding

Combine strategies.

Example: Geographic + Hash

function getShard(userId, region) {
  // First shard by region
  let shards;
  if (region === 'us') shards = ['us-shard1', 'us-shard2'];
  if (region === 'eu') shards = ['eu-shard1', 'eu-shard2'];

  // Then hash within region
  const hash = murmurhash(userId);
  const index = hash % shards.length;
  return shards[index];
}

Best for: Complex requirements needing multiple dimensions.

Choosing a Shard Key

The shard key is the most critical decision.

Criteria for Good Shard Key

1. High cardinality Many unique values.

✅ user_id (millions of unique values)
✅ email (unique per user)
❌ status (only 'active' or 'inactive')
❌ country (only ~200 values)

2. Even distribution Data spread evenly across shards.

✅ UUID (random distribution)
✅ Hash of email
❌ Auto-increment ID (all new records on one shard)
❌ Timestamp (all recent data on one shard)

Unless you use range sharding intentionally.

3. Query patterns align Most queries use the shard key.

-- Good: Query includes shard key
SELECT * FROM orders WHERE user_id = 123;
-- Can route to single shard

-- Bad: Query doesn't include shard key
SELECT * FROM orders WHERE status = 'pending';
-- Must query ALL shards

4. Immutable Key doesn’t change.

✅ user_id (never changes)
✅ email (rarely changes, manageable)
❌ username (users change it)
❌ location (users move)

5. Accessible Key is available when writing.

// Good: Have user_id when creating order
createOrder(userId, orderData) {
  const shard = getShardForUser(userId);
  await shard.insert('orders', { user_id: userId, ...orderData });
}

// Bad: Don't have shard key
createGlobalEvent(eventData) {
  // Which shard? Event isn't tied to user
}

Common Shard Keys

user_id: Most common. Aligns with query patterns (get user’s data).

tenant_id: Multi-tenant apps. Each customer’s data together.

geographic_region: Global apps. Data near users.

timestamp: Time-series data. Recent data hot, old data cold.

entity_id: Specific entity (order_id for orders, product_id for products).

Implementation Architecture

Routing Layer

Application needs to know which shard to query.

1. Application-level routing

class ShardedDatabase {
  constructor(shards) {
    this.shards = shards; // { shard1: connection, shard2: connection }
  }

  getShardForUser(userId) {
    const hash = murmurhash(userId);
    const index = hash % Object.keys(this.shards).length;
    return this.shards[`shard${index}`];
  }

  async getUserById(userId) {
    const shard = this.getShardForUser(userId);
    return await shard.query('SELECT * FROM users WHERE id = $1', [userId]);
  }

  async getAllActiveUsers() {
    // Query all shards in parallel
    const results = await Promise.all(
      Object.values(this.shards).map(shard =>
        shard.query('SELECT * FROM users WHERE status = $1', ['active'])
      )
    );
    return results.flat();
  }
}

2. Proxy layer (Vitess, ProxySQL, etc.)

Application
     │
     ▼
┌─────────┐
│ Vitess  │  (Routing proxy)
└────┬────┘
     │
  ┌──┴───┬───────┬────────┐
  ▼      ▼       ▼        ▼
Shard1 Shard2 Shard3  Shard4

Application sends SQL to proxy. Proxy routes to correct shard.

Benefits:

Centralized routing logic
Connection pooling
Query rewriting
Transparent to application

3. Database-native sharding (Citus for PostgreSQL)

-- Citus shards automatically
SELECT create_distributed_table('users', 'user_id');

-- Queries automatically routed
SELECT * FROM users WHERE user_id = 123;  -- Single shard
SELECT * FROM users WHERE status = 'active';  -- All shards

Metadata Store

Track which shard contains what.

Shard map:

const shardMap = {
  'shard1': { range: [1, 25_000_000], host: 'db1.example.com' },
  'shard2': { range: [25_000_001, 50_000_000], host: 'db2.example.com' },
  'shard3': { range: [50_000_001, 75_000_000], host: 'db3.example.com' },
  'shard4': { range: [75_000_001, 100_000_000], host: 'db4.example.com' }
};

Store in:

Configuration file (simple, needs restart to update)
Redis/Consul (dynamic, highly available)
Database (centralized, single source of truth)

Data Distribution

Initial load:

# Distribute existing data to shards
for user_id in $(seq 1 100000000); do
  shard=$(calculate_shard $user_id)
  migrate_user_to_shard $user_id $shard
done

Ongoing writes:

async function createUser(userData) {
  const userId = generateUserId();
  const shard = getShardForUser(userId);

  await shard.insert('users', {
    id: userId,
    ...userData
  });
}

Cross-Shard Operations

The hard part of sharding.

Cross-Shard Queries

Problem:

-- Query spans all shards
SELECT * FROM users WHERE status = 'active' ORDER BY created_at LIMIT 100;

Solution: Fan-out and merge

async function getActiveUsers() {
  // Query all shards in parallel
  const shardResults = await Promise.all(
    shards.map(shard =>
      shard.query('SELECT * FROM users WHERE status = $1 ORDER BY created_at LIMIT 100', ['active'])
    )
  );

  // Merge and re-sort
  const allResults = shardResults.flat();
  allResults.sort((a, b) => b.created_at - a.created_at);
  return allResults.slice(0, 100);
}

Performance impact: Query time = slowest shard + merge time.

Cross-Shard Joins

Problem:

-- Users and orders on different shards
SELECT u.name, o.total
FROM users u
JOIN orders o ON u.id = o.user_id
WHERE o.status = 'completed';

Solutions:

1. Denormalize (recommended)

// Store user name with order
{
  order_id: 123,
  user_id: 456,
  user_name: 'John Doe',  // Denormalized
  total: 99.99
}

2. Application-level join

// Get orders
const orders = await getAllShards('orders', { status: 'completed' });

// Get user data for each order
const userIds = [...new Set(orders.map(o => o.user_id))];
const users = await getUsersByIds(userIds);  // Might hit multiple shards

// Join in application
const results = orders.map(order => ({
  ...order,
  user_name: users.find(u => u.id === order.user_id).name
}));

3. Shard together

Shard 1: Users 1-25M + Their Orders
Shard 2: Users 25M-50M + Their Orders

Related data stays together. Joins within shard.

Cross-Shard Transactions

Problem: ACID transactions across shards are hard/impossible.

Avoid if possible:

// Bad: Transaction spans shards
BEGIN TRANSACTION;
  UPDATE shard1.accounts SET balance = balance - 100 WHERE user_id = 1;
  UPDATE shard2.accounts SET balance = balance + 100 WHERE user_id = 2;
COMMIT;  // Can't guarantee atomicity across shards!

Solutions:

1. Avoid cross-shard transactions Shard so transactions stay within shard:

Shard by account_id
Transfer within account → single shard
Transfer between accounts → may span shards

2. Two-phase commit (complex, slow)

Coordinator:
  1. Prepare phase: Ask all shards if they can commit
  2. Commit phase: If all say yes, commit; otherwise, rollback

Distributed transactions are slow and complex. Avoid.

3. Eventual consistency / Sagas

// Saga pattern
async function transferMoney(fromUser, toUser, amount) {
  try {
    // Step 1: Debit from user
    await debitAccount(fromUser, amount);

    // Step 2: Credit to user
    await creditAccount(toUser, amount);
  } catch (error) {
    // Compensating transaction
    await creditAccount(fromUser, amount);  // Rollback
    throw error;
  }
}

Eventually consistent. Not ACID but acceptable for many use cases.

Resharding

Adding/removing shards or rebalancing data.

When to Reshard

Shard is full (disk space)
Shard is hot (too much load)
Adding capacity (more shards)
Rebalancing (uneven distribution)

Resharding Strategies

1. Stop-the-world (simple, downtime)

1. Stop writes
2. Redistribute data
3. Update shard map
4. Resume writes

Only for small datasets or maintenance windows.

2. Dual-write (zero downtime)

1. Start writing to both old and new shards
2. Backfill data from old to new shards
3. Verify data consistency
4. Switch reads to new shards
5. Stop writing to old shards
6. Decommission old shards

Process:

// Phase 1: Dual write
async function createUser(userData) {
  const userId = generateUserId();

  // Write to old shard
  const oldShard = getOldShard(userId);
  await oldShard.insert('users', { id: userId, ...userData });

  // Write to new shard
  const newShard = getNewShard(userId);
  await newShard.insert('users', { id: userId, ...userData });
}

// Phase 2: Backfill
async function backfillData() {
  const users = await oldShards.query('SELECT * FROM users');

  for (const user of users) {
    const newShard = getNewShard(user.id);
    await newShard.insert('users', user);
  }
}

// Phase 3: Switch reads
async function getUser(userId) {
  const newShard = getNewShard(userId);
  return await newShard.query('SELECT * FROM users WHERE id = $1', [userId]);
}

// Phase 4: Stop old writes, decommission

3. Consistent hashing (minimal data movement)

// Adding shard4 to 3-shard setup
// Only ~25% of data moves (1/4), not ~33% (1/3)
const ring = new ConsistentHashRing(['shard1', 'shard2', 'shard3']);
ring.addNode('shard4');  // Minimal redistribution

Monitoring and Operations

Key Metrics

Per-shard metrics:

- Query rate (QPS per shard)
- Latency (p50, p95, p99)
- CPU utilization
- Disk usage
- Connection count
- Replication lag (if replicas)

Identify:

Hot shards (getting too much traffic)
Slow shards (high latency)
Full shards (disk space)

Balancing Load

Techniques:

Add read replicas to hot shards
Split hot shard into two
Move large tenants to dedicated shards

Backup and Recovery

Per-shard backups:

# Backup each shard
pg_dump shard1 > shard1_backup.sql
pg_dump shard2 > shard2_backup.sql
pg_dump shard3 > shard3_backup.sql

Point-in-time recovery:

Each shard can restore independently
Coordinate restore times across shards for consistency

Schema Migrations

Challenge: Apply schema changes across all shards.

Process:

# 1. Test migration on one shard
psql shard1 < migration.sql

# 2. If successful, roll out to all shards
for shard in shard1 shard2 shard3 shard4; do
  psql $shard < migration.sql
done

Tools:

Flyway, Liquibase for versioned migrations
Run migrations in parallel for speed
Have rollback plan

Real-World Example: Instagram

Instagram shards PostgreSQL by user_id.

Architecture (simplified):

User 1-10M    → Shard 1
User 10M-20M  → Shard 2
User 20M-30M  → Shard 3
...

Shard key: user_id

Why:

Most queries are user-scoped (get my feed, my photos)
User data stays together
No cross-shard joins for user operations

Query examples:

-- Single shard
SELECT * FROM photos WHERE user_id = 123;

-- Single shard
SELECT * FROM followers WHERE follower_id = 123;

-- Fan-out (rare)
SELECT * FROM photos WHERE tag = 'sunset';
-- Must query all shards

Benefits:

Scales to billions of users
Add shards as growth continues
Each shard handles ~10M users

Challenges:

Cross-user queries are expensive
Resharding required as user base grows
Application routing logic complexity

Alternatives to Sharding

Before you shard, consider:

1. NewSQL Databases (Auto-sharding)

CockroachDB, Google Spanner, YugabyteDB

Distributed SQL databases that shard automatically.

-- Looks like regular SQL
CREATE TABLE users (id UUID PRIMARY KEY, name TEXT);
INSERT INTO users VALUES (uuid(), 'John Doe');

-- Automatically distributed across nodes
-- No manual sharding logic

Benefits:

Distributed without application changes
ACID transactions across nodes
Automatic rebalancing

Trade-offs:

Different from PostgreSQL/MySQL
Migration effort
Operational complexity

2. NoSQL (Built-in sharding)

MongoDB, Cassandra, DynamoDB

Designed for horizontal scaling.

// MongoDB sharding
sh.enableSharding("mydb");
sh.shardCollection("mydb.users", { "user_id": 1 });

// Automatically distributes data

Benefits:

Sharding is built-in
Scales easily
High write throughput

Trade-offs:

Different query model
Eventual consistency (in some cases)
No joins (in some cases)

3. Hybrid (Shard-per-service)

Microservices, each with own database.

User Service    → User Database
Order Service   → Order Database
Payment Service → Payment Database

Benefits:

Service isolation
Independent scaling
Each service can shard independently if needed

Trade-offs:

Cross-service queries complex
Distributed transactions difficult

Checklist: Should You Shard?

Before sharding, have you:

Optimized queries and added indexes?
Upgraded to larger server (vertical scaling)?
Added read replicas for read-heavy load?
Implemented caching (Redis, Memcached)?
Partitioned tables within single database?
Profiled and identified actual bottleneck?

Shard when:

Data size exceeds single server capacity
Write throughput exceeds single server
Geographic distribution required
Tenant isolation required

Before sharding, plan:

Shard key chosen (high cardinality, even distribution, aligns with queries)
Sharding strategy decided (range, hash, geographic, entity)
Routing architecture designed (app-level, proxy, or DB-native)
Cross-shard query strategy (fan-out, denormalization, avoid)
Resharding plan (how to add/remove shards)
Monitoring and observability (per-shard metrics)
Team has expertise to operate sharded system

Conclusion

Sharding is powerful but not free.

When to shard:

You’ve exhausted other options
You need to scale beyond single-server limits
You have clear, stable shard key

How to shard:

Choose shard key carefully (immutable, high cardinality, aligns with queries)
Pick sharding strategy (range, hash, geographic, entity)
Implement routing layer (app, proxy, or DB-native)
Plan for cross-shard operations (avoid when possible)
Design for resharding (you will need it)

Trade-offs:

Complexity: Routing, cross-shard queries, transactions
Flexibility: Harder to change shard key
Operations: More databases to monitor, backup, migrate

Alternatives:

NewSQL (auto-sharding)
NoSQL (built-in sharding)
Microservices (natural sharding boundary)

The right approach: Shard when you need to. Not before. Plan carefully. Start simple. Scale intentionally.

Build systems that scale. But only add complexity when the benefits justify the costs.

Resources:

“Shard when you must. But exhaust simpler options first.”

AI & ML

Building an AI Blog Writer: From Topic to Published Post with n8n, Claude, and GitHub

Developer skills

Cutting Cortex LLM Costs by 90%: The Prompt Engineering Playbook

Engineering

Watching Infrastructure Learn From Itself: A Claude Code Reflection

Enterprise software

Zero-Downtime Database Migrations

News & insights

From Idea to Production in 28 Days

Open Source

Personal AI Operations Memory: Building a Learning System for Git-Ops

Security

Concept: Homomorphic encryption techniques for secure computation on encrypted data

TL;DR

What is Sharding?

Single Database (No Sharding)

Sharded Database

When Do You Need Sharding?

Exhaust These First

Shard When

Sharding Strategies

1. Range-Based Sharding

2. Hash-Based Sharding

3. Geographic Sharding

4. Entity/Tenant-Based Sharding

5. Hybrid Sharding

Choosing a Shard Key

Criteria for Good Shard Key

Common Shard Keys

Implementation Architecture

Routing Layer

Metadata Store

Data Distribution

Cross-Shard Operations

Cross-Shard Queries

Cross-Shard Joins

Cross-Shard Transactions

Resharding

When to Reshard

Resharding Strategies

Monitoring and Operations

Key Metrics

Balancing Load

Backup and Recovery

Schema Migrations

Real-World Example: Instagram

Alternatives to Sharding

1. NewSQL Databases (Auto-sharding)

2. NoSQL (Built-in sharding)

3. Hybrid (Shard-per-service)

Checklist: Should You Shard?

Conclusion