Skip to main content

Database Sharding Strategies for Scale: From Theory to Production

Ryan Dahlberg
Ryan Dahlberg
November 8, 2025 17 min read
Share:
Database Sharding Strategies for Scale: From Theory to Production

TL;DR

Database sharding is horizontal partitioning - splitting data across multiple database instances to scale beyond single-server limits. It’s powerful but complex. Shard when you’ve exhausted vertical scaling (bigger servers) and read replicas. Choose your shard key carefully - it determines data distribution and query patterns. Common strategies include range-based, hash-based, and geographic sharding. Implement routing logic, handle cross-shard queries, and plan for resharding. Sharding isn’t free - it adds complexity, but when done right, it enables unlimited horizontal scale.

Key Concepts:

  • Sharding = horizontal partitioning across multiple databases
  • Shard key determines which database stores each record
  • Range-based: Shard by value ranges (user IDs 1-1M, 1M-2M)
  • Hash-based: Shard by hash of key (consistent hashing)
  • Geographic: Shard by location (EU data, US data)
  • Trade-offs: Complexity vs scalability

What is Sharding?

Sharding is splitting your database horizontally across multiple servers.

Vertical scaling: Bigger server (more CPU, RAM, storage) Horizontal scaling: More servers (distribute data across them)

Single Database (No Sharding)

┌─────────────────────────┐
│   PostgreSQL Server     │
│                         │
│  All 100M user records  │
│  All 500M transactions  │
│  All application data   │
└─────────────────────────┘

Limits:

  • CPU: Single server’s processing power
  • RAM: Limited memory for caching
  • Storage: Single server’s disk capacity
  • I/O: Single disk’s throughput

Eventually you hit the ceiling.

Sharded Database

┌─────────────┐  ┌─────────────┐  ┌─────────────┐  ┌─────────────┐
│  Shard 1    │  │  Shard 2    │  │  Shard 3    │  │  Shard 4    │
│             │  │             │  │             │  │             │
│  Users      │  │  Users      │  │  Users      │  │  Users      │
│  1-25M      │  │  25M-50M    │  │  50M-75M    │  │  75M-100M   │
└─────────────┘  └─────────────┘  └─────────────┘  └─────────────┘

Benefits:

  • 4x CPU capacity
  • 4x RAM for caching
  • 4x storage capacity
  • 4x I/O throughput

Horizontal scale = add more shards.


When Do You Need Sharding?

Sharding is complex. Only do it when necessary.

Exhaust These First

1. Optimize queries

-- Add indexes
CREATE INDEX idx_users_email ON users(email);

-- Analyze slow queries
EXPLAIN ANALYZE SELECT * FROM users WHERE email = 'user@example.com';

2. Vertical scaling Upgrade to bigger server:

  • 8 cores → 32 cores
  • 32GB RAM → 256GB RAM
  • HDD → NVMe SSD

Modern servers are powerful. You can get far with vertical scaling.

3. Read replicas

         ┌──────────┐
         │  Primary │  (Writes)
         └────┬─────┘

      ┌───────┴───────┐
      │               │
┌─────▼────┐    ┌─────▼────┐
│ Replica 1│    │ Replica 2│  (Reads)
└──────────┘    └──────────┘

Scale reads horizontally. Most apps are read-heavy.

4. Caching

Application → Redis (cache) → Database

Cache hot data. Reduce database load by 80%+.

5. Partitioning (single database)

-- PostgreSQL partitioning
CREATE TABLE orders (
  id BIGSERIAL,
  created_at TIMESTAMP,
  ...
) PARTITION BY RANGE (created_at);

CREATE TABLE orders_2025_q1 PARTITION OF orders
  FOR VALUES FROM ('2025-01-01') TO ('2025-04-01');

Partition within single database first.

Shard When

You need sharding when:

1. Data exceeds single server capacity

100M+ users
Billions of records
Multi-terabyte datasets

2. Write traffic exceeds single server

100K+ writes/second
High concurrent transactions
Write-heavy workload

3. Geographic distribution required

EU users → EU database (GDPR compliance)
US users → US database (latency)
Asia users → Asia database (performance)

4. Tenant isolation

Enterprise customers get dedicated shard
Security/compliance requirements

Don’t shard prematurely. It’s a one-way door. Hard to un-shard.


Sharding Strategies

Choosing the right strategy is critical.

1. Range-Based Sharding

Partition by value ranges.

Example: User ID ranges

Shard 1: user_id 1 to 25,000,000
Shard 2: user_id 25,000,001 to 50,000,000
Shard 3: user_id 50,000,001 to 75,000,000
Shard 4: user_id 75,000,001 to 100,000,000

Routing logic:

function getShardForUser(userId) {
  if (userId <= 25_000_000) return 'shard1';
  if (userId <= 50_000_000) return 'shard2';
  if (userId <= 75_000_000) return 'shard3';
  return 'shard4';
}

Pros:

  • Simple to understand
  • Easy to add new ranges (new shard for 100M-125M)
  • Range queries stay on single shard
  • Good for time-series data

Cons:

  • Uneven distribution if data isn’t uniform
  • Hotspots (newest users on latest shard get all writes)
  • Requires planning ranges upfront

Best for:

  • Time-series data (logs, events, transactions)
  • Auto-incrementing IDs
  • Sequential data

Example: Time-based sharding

Shard 1: transactions from 2023
Shard 2: transactions from 2024
Shard 3: transactions from 2025

Recent data (hot) on separate shard from historical (cold).

2. Hash-Based Sharding

Partition by hash of key.

Example: Hash user ID

function getShardForUser(userId) {
  const hash = murmurhash(userId);
  const shardIndex = hash % NUM_SHARDS;
  return `shard${shardIndex + 1}`;
}

Distribution:

User 123   → hash(123) % 4 = 2 → Shard 2
User 456   → hash(456) % 4 = 1 → Shard 1
User 789   → hash(789) % 4 = 3 → Shard 3
User 1000  → hash(1000) % 4 = 0 → Shard 0

Pros:

  • Even distribution (no hotspots)
  • Deterministic (same key always goes to same shard)
  • No pre-planning of ranges

Cons:

  • Resharding is painful (changing number of shards rehashes everything)
  • Range queries span all shards
  • Related data might be split

Best for:

  • Evenly distributed keys
  • Point lookups (get by ID)
  • Avoiding hotspots

Consistent Hashing (Better):

// Use consistent hashing to minimize data movement when adding shards
const ring = new ConsistentHashRing(['shard1', 'shard2', 'shard3']);

function getShardForUser(userId) {
  return ring.getNode(userId);
}

// Adding shard4 only moves ~25% of data (1/4), not ~75%
ring.addNode('shard4');

3. Geographic Sharding

Partition by location.

Example: User location

Shard US-East:  Users in North America
Shard EU-West:  Users in Europe
Shard AP-South: Users in Asia-Pacific

Routing:

function getShardForUser(userId) {
  const user = getUserMetadata(userId);

  if (user.region === 'us') return 'shard-us-east';
  if (user.region === 'eu') return 'shard-eu-west';
  if (user.region === 'ap') return 'shard-ap-south';
}

Pros:

  • Low latency (data near users)
  • Compliance (GDPR, data sovereignty)
  • Geographic fault isolation

Cons:

  • Uneven distribution (more US users than EU)
  • Cross-region queries are slow/complex
  • User migration between regions

Best for:

  • Global applications
  • Compliance requirements
  • Latency-sensitive apps

4. Entity/Tenant-Based Sharding

Partition by entity or tenant.

Example: Enterprise SaaS

Shard 1: Customers A-M
Shard 2: Customers N-Z
Shard 3: Large enterprise customers

Routing:

function getShardForCustomer(customerId) {
  const customer = getCustomer(customerId);

  if (customer.tier === 'enterprise') {
    return customer.dedicatedShard; // 'shard-acme-corp'
  }

  const firstLetter = customer.name[0].toUpperCase();
  if (firstLetter <= 'M') return 'shard1';
  return 'shard2';
}

Pros:

  • Tenant isolation (security, compliance)
  • Per-tenant optimization
  • Easy to migrate individual tenants

Cons:

  • Uneven load if tenants vary in size
  • More complex routing

Best for:

  • Multi-tenant SaaS
  • B2B applications
  • Customers requiring isolation

5. Hybrid Sharding

Combine strategies.

Example: Geographic + Hash

function getShard(userId, region) {
  // First shard by region
  let shards;
  if (region === 'us') shards = ['us-shard1', 'us-shard2'];
  if (region === 'eu') shards = ['eu-shard1', 'eu-shard2'];

  // Then hash within region
  const hash = murmurhash(userId);
  const index = hash % shards.length;
  return shards[index];
}

Best for: Complex requirements needing multiple dimensions.


Choosing a Shard Key

The shard key is the most critical decision.

Criteria for Good Shard Key

1. High cardinality Many unique values.

✅ user_id (millions of unique values)
✅ email (unique per user)
❌ status (only 'active' or 'inactive')
❌ country (only ~200 values)

2. Even distribution Data spread evenly across shards.

✅ UUID (random distribution)
✅ Hash of email
❌ Auto-increment ID (all new records on one shard)
❌ Timestamp (all recent data on one shard)

Unless you use range sharding intentionally.

3. Query patterns align Most queries use the shard key.

-- Good: Query includes shard key
SELECT * FROM orders WHERE user_id = 123;
-- Can route to single shard

-- Bad: Query doesn't include shard key
SELECT * FROM orders WHERE status = 'pending';
-- Must query ALL shards

4. Immutable Key doesn’t change.

✅ user_id (never changes)
✅ email (rarely changes, manageable)
❌ username (users change it)
❌ location (users move)

5. Accessible Key is available when writing.

// Good: Have user_id when creating order
createOrder(userId, orderData) {
  const shard = getShardForUser(userId);
  await shard.insert('orders', { user_id: userId, ...orderData });
}

// Bad: Don't have shard key
createGlobalEvent(eventData) {
  // Which shard? Event isn't tied to user
}

Common Shard Keys

user_id: Most common. Aligns with query patterns (get user’s data).

tenant_id: Multi-tenant apps. Each customer’s data together.

geographic_region: Global apps. Data near users.

timestamp: Time-series data. Recent data hot, old data cold.

entity_id: Specific entity (order_id for orders, product_id for products).


Implementation Architecture

Routing Layer

Application needs to know which shard to query.

1. Application-level routing

class ShardedDatabase {
  constructor(shards) {
    this.shards = shards; // { shard1: connection, shard2: connection }
  }

  getShardForUser(userId) {
    const hash = murmurhash(userId);
    const index = hash % Object.keys(this.shards).length;
    return this.shards[`shard${index}`];
  }

  async getUserById(userId) {
    const shard = this.getShardForUser(userId);
    return await shard.query('SELECT * FROM users WHERE id = $1', [userId]);
  }

  async getAllActiveUsers() {
    // Query all shards in parallel
    const results = await Promise.all(
      Object.values(this.shards).map(shard =>
        shard.query('SELECT * FROM users WHERE status = $1', ['active'])
      )
    );
    return results.flat();
  }
}

2. Proxy layer (Vitess, ProxySQL, etc.)

Application


┌─────────┐
│ Vitess  │  (Routing proxy)
└────┬────┘

  ┌──┴───┬───────┬────────┐
  ▼      ▼       ▼        ▼
Shard1 Shard2 Shard3  Shard4

Application sends SQL to proxy. Proxy routes to correct shard.

Benefits:

  • Centralized routing logic
  • Connection pooling
  • Query rewriting
  • Transparent to application

3. Database-native sharding (Citus for PostgreSQL)

-- Citus shards automatically
SELECT create_distributed_table('users', 'user_id');

-- Queries automatically routed
SELECT * FROM users WHERE user_id = 123;  -- Single shard
SELECT * FROM users WHERE status = 'active';  -- All shards

Metadata Store

Track which shard contains what.

Shard map:

const shardMap = {
  'shard1': { range: [1, 25_000_000], host: 'db1.example.com' },
  'shard2': { range: [25_000_001, 50_000_000], host: 'db2.example.com' },
  'shard3': { range: [50_000_001, 75_000_000], host: 'db3.example.com' },
  'shard4': { range: [75_000_001, 100_000_000], host: 'db4.example.com' }
};

Store in:

  • Configuration file (simple, needs restart to update)
  • Redis/Consul (dynamic, highly available)
  • Database (centralized, single source of truth)

Data Distribution

Initial load:

# Distribute existing data to shards
for user_id in $(seq 1 100000000); do
  shard=$(calculate_shard $user_id)
  migrate_user_to_shard $user_id $shard
done

Ongoing writes:

async function createUser(userData) {
  const userId = generateUserId();
  const shard = getShardForUser(userId);

  await shard.insert('users', {
    id: userId,
    ...userData
  });
}

Cross-Shard Operations

The hard part of sharding.

Cross-Shard Queries

Problem:

-- Query spans all shards
SELECT * FROM users WHERE status = 'active' ORDER BY created_at LIMIT 100;

Solution: Fan-out and merge

async function getActiveUsers() {
  // Query all shards in parallel
  const shardResults = await Promise.all(
    shards.map(shard =>
      shard.query('SELECT * FROM users WHERE status = $1 ORDER BY created_at LIMIT 100', ['active'])
    )
  );

  // Merge and re-sort
  const allResults = shardResults.flat();
  allResults.sort((a, b) => b.created_at - a.created_at);
  return allResults.slice(0, 100);
}

Performance impact: Query time = slowest shard + merge time.

Cross-Shard Joins

Problem:

-- Users and orders on different shards
SELECT u.name, o.total
FROM users u
JOIN orders o ON u.id = o.user_id
WHERE o.status = 'completed';

Solutions:

1. Denormalize (recommended)

// Store user name with order
{
  order_id: 123,
  user_id: 456,
  user_name: 'John Doe',  // Denormalized
  total: 99.99
}

2. Application-level join

// Get orders
const orders = await getAllShards('orders', { status: 'completed' });

// Get user data for each order
const userIds = [...new Set(orders.map(o => o.user_id))];
const users = await getUsersByIds(userIds);  // Might hit multiple shards

// Join in application
const results = orders.map(order => ({
  ...order,
  user_name: users.find(u => u.id === order.user_id).name
}));

3. Shard together

Shard 1: Users 1-25M + Their Orders
Shard 2: Users 25M-50M + Their Orders

Related data stays together. Joins within shard.

Cross-Shard Transactions

Problem: ACID transactions across shards are hard/impossible.

Avoid if possible:

// Bad: Transaction spans shards
BEGIN TRANSACTION;
  UPDATE shard1.accounts SET balance = balance - 100 WHERE user_id = 1;
  UPDATE shard2.accounts SET balance = balance + 100 WHERE user_id = 2;
COMMIT;  // Can't guarantee atomicity across shards!

Solutions:

1. Avoid cross-shard transactions Shard so transactions stay within shard:

Shard by account_id
Transfer within account → single shard
Transfer between accounts → may span shards

2. Two-phase commit (complex, slow)

Coordinator:
  1. Prepare phase: Ask all shards if they can commit
  2. Commit phase: If all say yes, commit; otherwise, rollback

Distributed transactions are slow and complex. Avoid.

3. Eventual consistency / Sagas

// Saga pattern
async function transferMoney(fromUser, toUser, amount) {
  try {
    // Step 1: Debit from user
    await debitAccount(fromUser, amount);

    // Step 2: Credit to user
    await creditAccount(toUser, amount);
  } catch (error) {
    // Compensating transaction
    await creditAccount(fromUser, amount);  // Rollback
    throw error;
  }
}

Eventually consistent. Not ACID but acceptable for many use cases.


Resharding

Adding/removing shards or rebalancing data.

When to Reshard

  • Shard is full (disk space)
  • Shard is hot (too much load)
  • Adding capacity (more shards)
  • Rebalancing (uneven distribution)

Resharding Strategies

1. Stop-the-world (simple, downtime)

1. Stop writes
2. Redistribute data
3. Update shard map
4. Resume writes

Only for small datasets or maintenance windows.

2. Dual-write (zero downtime)

1. Start writing to both old and new shards
2. Backfill data from old to new shards
3. Verify data consistency
4. Switch reads to new shards
5. Stop writing to old shards
6. Decommission old shards

Process:

// Phase 1: Dual write
async function createUser(userData) {
  const userId = generateUserId();

  // Write to old shard
  const oldShard = getOldShard(userId);
  await oldShard.insert('users', { id: userId, ...userData });

  // Write to new shard
  const newShard = getNewShard(userId);
  await newShard.insert('users', { id: userId, ...userData });
}

// Phase 2: Backfill
async function backfillData() {
  const users = await oldShards.query('SELECT * FROM users');

  for (const user of users) {
    const newShard = getNewShard(user.id);
    await newShard.insert('users', user);
  }
}

// Phase 3: Switch reads
async function getUser(userId) {
  const newShard = getNewShard(userId);
  return await newShard.query('SELECT * FROM users WHERE id = $1', [userId]);
}

// Phase 4: Stop old writes, decommission

3. Consistent hashing (minimal data movement)

// Adding shard4 to 3-shard setup
// Only ~25% of data moves (1/4), not ~33% (1/3)
const ring = new ConsistentHashRing(['shard1', 'shard2', 'shard3']);
ring.addNode('shard4');  // Minimal redistribution

Monitoring and Operations

Key Metrics

Per-shard metrics:

- Query rate (QPS per shard)
- Latency (p50, p95, p99)
- CPU utilization
- Disk usage
- Connection count
- Replication lag (if replicas)

Identify:

  • Hot shards (getting too much traffic)
  • Slow shards (high latency)
  • Full shards (disk space)

Balancing Load

Techniques:

  • Add read replicas to hot shards
  • Split hot shard into two
  • Move large tenants to dedicated shards

Backup and Recovery

Per-shard backups:

# Backup each shard
pg_dump shard1 > shard1_backup.sql
pg_dump shard2 > shard2_backup.sql
pg_dump shard3 > shard3_backup.sql

Point-in-time recovery:

  • Each shard can restore independently
  • Coordinate restore times across shards for consistency

Schema Migrations

Challenge: Apply schema changes across all shards.

Process:

# 1. Test migration on one shard
psql shard1 < migration.sql

# 2. If successful, roll out to all shards
for shard in shard1 shard2 shard3 shard4; do
  psql $shard < migration.sql
done

Tools:

  • Flyway, Liquibase for versioned migrations
  • Run migrations in parallel for speed
  • Have rollback plan

Real-World Example: Instagram

Instagram shards PostgreSQL by user_id.

Architecture (simplified):

User 1-10M    → Shard 1
User 10M-20M  → Shard 2
User 20M-30M  → Shard 3
...

Shard key: user_id

Why:

  • Most queries are user-scoped (get my feed, my photos)
  • User data stays together
  • No cross-shard joins for user operations

Query examples:

-- Single shard
SELECT * FROM photos WHERE user_id = 123;

-- Single shard
SELECT * FROM followers WHERE follower_id = 123;

-- Fan-out (rare)
SELECT * FROM photos WHERE tag = 'sunset';
-- Must query all shards

Benefits:

  • Scales to billions of users
  • Add shards as growth continues
  • Each shard handles ~10M users

Challenges:

  • Cross-user queries are expensive
  • Resharding required as user base grows
  • Application routing logic complexity

Alternatives to Sharding

Before you shard, consider:

1. NewSQL Databases (Auto-sharding)

CockroachDB, Google Spanner, YugabyteDB

Distributed SQL databases that shard automatically.

-- Looks like regular SQL
CREATE TABLE users (id UUID PRIMARY KEY, name TEXT);
INSERT INTO users VALUES (uuid(), 'John Doe');

-- Automatically distributed across nodes
-- No manual sharding logic

Benefits:

  • Distributed without application changes
  • ACID transactions across nodes
  • Automatic rebalancing

Trade-offs:

  • Different from PostgreSQL/MySQL
  • Migration effort
  • Operational complexity

2. NoSQL (Built-in sharding)

MongoDB, Cassandra, DynamoDB

Designed for horizontal scaling.

// MongoDB sharding
sh.enableSharding("mydb");
sh.shardCollection("mydb.users", { "user_id": 1 });

// Automatically distributes data

Benefits:

  • Sharding is built-in
  • Scales easily
  • High write throughput

Trade-offs:

  • Different query model
  • Eventual consistency (in some cases)
  • No joins (in some cases)

3. Hybrid (Shard-per-service)

Microservices, each with own database.

User Service    → User Database
Order Service   → Order Database
Payment Service → Payment Database

Benefits:

  • Service isolation
  • Independent scaling
  • Each service can shard independently if needed

Trade-offs:

  • Cross-service queries complex
  • Distributed transactions difficult

Checklist: Should You Shard?

Before sharding, have you:

  • Optimized queries and added indexes?
  • Upgraded to larger server (vertical scaling)?
  • Added read replicas for read-heavy load?
  • Implemented caching (Redis, Memcached)?
  • Partitioned tables within single database?
  • Profiled and identified actual bottleneck?

Shard when:

  • Data size exceeds single server capacity
  • Write throughput exceeds single server
  • Geographic distribution required
  • Tenant isolation required

Before sharding, plan:

  • Shard key chosen (high cardinality, even distribution, aligns with queries)
  • Sharding strategy decided (range, hash, geographic, entity)
  • Routing architecture designed (app-level, proxy, or DB-native)
  • Cross-shard query strategy (fan-out, denormalization, avoid)
  • Resharding plan (how to add/remove shards)
  • Monitoring and observability (per-shard metrics)
  • Team has expertise to operate sharded system

Conclusion

Sharding is powerful but not free.

When to shard:

  • You’ve exhausted other options
  • You need to scale beyond single-server limits
  • You have clear, stable shard key

How to shard:

  • Choose shard key carefully (immutable, high cardinality, aligns with queries)
  • Pick sharding strategy (range, hash, geographic, entity)
  • Implement routing layer (app, proxy, or DB-native)
  • Plan for cross-shard operations (avoid when possible)
  • Design for resharding (you will need it)

Trade-offs:

  • Complexity: Routing, cross-shard queries, transactions
  • Flexibility: Harder to change shard key
  • Operations: More databases to monitor, backup, migrate

Alternatives:

  • NewSQL (auto-sharding)
  • NoSQL (built-in sharding)
  • Microservices (natural sharding boundary)

The right approach: Shard when you need to. Not before. Plan carefully. Start simple. Scale intentionally.

Build systems that scale. But only add complexity when the benefits justify the costs.


Resources:

“Shard when you must. But exhaust simpler options first.”

#Databases #Sharding #Scalability #Architecture #PostgreSQL #MySQL #Engineering