How We Reduced Infrastructure Costs by 60%

Six months ago, our infrastructure costs hit $150K per month. We served 2 million users with solid performance and reliability, but the economics weren’t sustainable. Leadership asked us to cut costs without compromising user experience.

This case study details how we reduced costs by 60% to $60K monthly while actually improving performance and reliability. The journey involved systematic measurement, architectural changes, and cultural shifts in how we thought about infrastructure spending.

Starting Point: The Cost Problem

Our infrastructure ran on AWS with a typical microservices architecture:

200+ EC2 instances across development, staging, and production
RDS PostgreSQL with read replicas
ElastiCache Redis for caching and sessions
S3 for object storage
CloudFront CDN for static assets
Application Load Balancers
Various AWS services (SQS, SNS, Lambda, etc.)

Monthly breakdown:

Compute (EC2): $75K
Database (RDS): $32K
Cache (ElastiCache): $18K
Data transfer: $15K
Storage (S3, EBS): $7K
Other services: $3K

We had grown organically without cost optimization focus. Engineers provisioned resources based on perceived needs rather than actual usage. Over-provisioning was common and considered “safe.”

Phase 1: Measurement and Visibility

You can’t optimize what you don’t measure. Our first step was understanding actual resource utilization.

Cost Allocation Tags

We implemented comprehensive tagging:

Environment: production | staging | development
Service: api | worker | frontend | etc.
Team: platform | product | data
Project: feature-name

Tags enabled cost analysis by service, team, and environment. This immediately revealed surprises:

Staging environment cost 40% as much as production
Three services accounted for 60% of compute costs
Development environments ran 24/7 despite only being used during work hours

Monitoring Resource Utilization

We deployed detailed monitoring:

CPU utilization per instance
Memory usage patterns
Network I/O
Disk I/O and usage
Database query patterns
Cache hit rates

The data showed systematic over-provisioning:

Average CPU utilization: 18%
Average memory utilization: 35%
Many instances had <5% CPU usage
Database instances provisioned for 10x actual load

Cost Anomaly Detection

We set up automated anomaly detection:

If cost increases >20% day-over-day:
  Alert team leads
  Include breakdown by service
  Compare to historical baseline

This prevented cost surprises and caught issues like:

Accidentally leaving large EC2 instances running
Data transfer spikes from misconfigurations
Zombie resources that should have been terminated

Phase 2: Low-Hanging Fruit

With visibility established, we tackled obvious waste.

Development and Staging Environments

Development environments only needed to run during work hours. We implemented:

Automated scheduling:

Start at 8 AM local time
Stop at 8 PM local time
Keep stopped over weekends

Result: $18K monthly savings from development environments alone.

For staging, we implemented:

On-demand staging environments created per feature branch
Automatic teardown after PR merge
Reduced standing staging infrastructure by 70%

Result: $12K monthly savings from staging optimization.

Right-Sizing Instances

We analyzed actual resource usage and right-sized instances:

API service:

Was: m5.2xlarge (8 vCPU, 32GB RAM)
Usage: 15% CPU, 8GB RAM
Now: m5.large (2 vCPU, 8GB RAM)
Savings per instance: 75%

We applied right-sizing systematically:

Reduced instance sizes for 80% of services
Increased only 3 services that were undersized
Added more smaller instances where needed for redundancy

Result: $28K monthly savings from right-sizing.

Reserved Instances and Savings Plans

For predictable workloads, we purchased reserved capacity:

1-year reserved instances for production databases
Compute savings plans for baseline compute needs
Spot instances for fault-tolerant batch workloads

Result: $8K monthly savings from commitment-based pricing.

Storage Optimization

S3 held 50TB of data, much of it rarely accessed:

Implemented lifecycle policies:

Age 0-30 days: Standard (frequent access)
Age 31-90 days: Infrequent Access
Age 91-365 days: Glacier Instant Retrieval
Age 365+ days: Glacier Deep Archive

Deleted unused data:

Old build artifacts
Temporary files never cleaned up
Redundant backups
Test data in production

Result: $3K monthly savings from storage optimization.

Phase 3: Architectural Changes

Low-hanging fruit gave us 46% savings ($69K remaining). Further optimization required architectural changes.

Database Optimization

Our largest RDS instance cost $15K monthly:

Analysis revealed:

Read queries dominated (95% reads, 5% writes)
Many queries fetched unnecessary columns
N+1 query patterns common
No query caching layer

Optimization approach:

Added read-through cache:

def get_user(user_id):
    # Check cache first
    cached = redis.get(f"user:{user_id}")
    if cached:
        return cached

    # Cache miss - query database
    user = db.query("SELECT * FROM users WHERE id = ?", user_id)
    redis.setex(f"user:{user_id}", 3600, user)
    return user

This pattern applied to hot data paths reduced database load by 70%.

Query optimization:

Fixed N+1 patterns with proper joins
Added database indexes for common queries
Used SELECT specific columns instead of SELECT *
Batch queries where possible

Connection pooling:

Reduced max connections from 500 to 100
Improved connection reuse
Eliminated connection storms during traffic spikes

Results:

Reduced RDS instance from db.r5.4xlarge to db.r5.xlarge
Eliminated two read replicas (cache absorbed read traffic)
Improved query response time by 40%
Savings: $18K monthly

CDN and Caching Strategy

We paid significant data transfer costs for assets that could be cached:

Before:

CloudFront TTL: 5 minutes
Many assets not cached at all
No cache invalidation strategy

After:

Static assets: 1 year TTL with versioned filenames
API responses: 1 hour TTL where appropriate
Proper cache headers on all responses
Automated cache invalidation on deployment

Results:

Origin requests dropped 85%
Data transfer costs decreased 60%
Page load time improved 30%
Savings: $9K monthly

Compute Architecture Changes

Our microservices architecture had inefficiencies:

Service consolidation: Some microservices were over-engineered. We consolidated:

5 simple microservices → 1 service with multiple modules
Reduced inter-service communication overhead
Simplified deployment and monitoring
Fewer EC2 instances needed

Serverless migration: For specific workloads, Lambda was more cost-effective:

Batch processing jobs
Scheduled tasks
Webhook handlers
Infrequent administrative tasks

Migrating these reduced EC2 instance count by 20.

Horizontal autoscaling tuning: Our autoscaling was conservative:

Minimum instances: 3
Scale up threshold: 60% CPU

New configuration:

Minimum instances: 2
Scale up threshold: 70% CPU
Faster scale-down after traffic drops

Results:

Reduced average instance count by 30%
Improved resource utilization to 50%+ CPU average
Maintained p99 latency SLA
Savings: $12K monthly

Phase 4: Data Transfer Optimization

Data transfer costs were high due to inefficient patterns.

Cross-AZ Transfer Elimination

Services communicated across availability zones unnecessarily:

Before:

Services randomly placed across 3 AZs
Every service call potentially crossed AZ boundaries
Cross-AZ transfer: $0.01/GB each way

After:

Services in same AZ communicate within AZ
Cross-AZ only for redundancy, not normal traffic
Reduced cross-AZ transfer by 80%

Savings: $4K monthly

Egress Optimization

We paid for outbound data transfer that could be avoided:

Implemented:

Compression for all API responses (gzip)
Image optimization (WebP format, proper sizing)
Aggressive CDN caching
Binary protocol for internal services (vs. JSON)

Results:

Average response size reduced 60%
Egress costs decreased 50%
Savings: $3K monthly

Phase 5: Cultural Changes

Technical optimizations got us to $60K monthly. Sustaining this required cultural shifts.

Cost Ownership

We assigned cost ownership to engineering teams:

Each team had a monthly cost budget
Teams received weekly cost reports
Cost reduction counted in performance reviews

This made cost everyone’s responsibility, not just the platform team’s.

Cost Consideration in Design

We updated our design review process:

New services required:

Estimated monthly cost
Justification for resource choices
Consideration of serverless alternatives
Auto-scaling strategy

This prevented new services from repeating old mistakes.

Reserved Capacity Planning

We established quarterly planning:

Review actual usage patterns
Purchase reserved instances for predictable load
Right-size existing reservations
Forecast capacity needs

This maximized savings plan benefits while remaining flexible.

Cost Optimization KPIs

We tracked cost efficiency metrics:

Cost per user: Total infrastructure cost / monthly active users
Cost per transaction: Infrastructure cost / API requests
Waste percentage: Unused capacity / total capacity

These metrics showed whether we were improving efficiency over time.

Results and Impact

After six months of optimization:

Cost reduction:

Starting: $150K/month
Ending: $60K/month
Reduction: 60%
Annual savings: $1.08M

Cost breakdown after optimization:

Compute (EC2): $25K (was $75K)
Database (RDS): $14K (was $32K)
Cache (ElastiCache): $10K (was $18K)
Data transfer: $6K (was $15K)
Storage (S3, EBS): $4K (was $7K)
Other services: $1K (was $3K)

Performance improvements:

p99 latency: 450ms → 320ms (29% improvement)
Cache hit rate: 40% → 85%
Database query time: 180ms → 110ms (39% improvement)
Page load time: 2.1s → 1.5s (29% improvement)

Reliability improvements:

Monthly incidents: 3.5 → 1.2 (66% reduction)
MTTR: 32 min → 18 min (44% improvement)
Uptime: 99.8% → 99.95%

Lessons Learned

Start with Visibility

We couldn’t optimize without understanding where money went. Comprehensive tagging and monitoring were essential.

Low-Hanging Fruit First

Development environment scheduling and right-sizing gave us quick wins that built momentum. Start with obvious waste.

Measure Everything

We tracked cost, performance, and reliability throughout. This prevented optimizations that saved money but hurt user experience.

Architecture Matters

The biggest savings came from architectural changes: caching strategy, database optimization, and service consolidation.

Culture is Critical

Technical changes were necessary but insufficient. Cost awareness had to become part of engineering culture.

Optimization is Ongoing

Costs drift upward without continued attention. We established quarterly reviews to maintain efficiency.

Common Pitfalls to Avoid

Optimizing Too Early

Premature optimization wastes time. Wait until costs justify optimization effort.

Sacrificing Reliability for Cost

We never compromised redundancy, monitoring, or backup strategies. Reliability pays for itself.

One-Time Optimization

Without ongoing attention, costs creep back up. Build optimization into regular processes.

Ignoring Developer Experience

Development environment scheduling saved money but initially frustrated developers. We adjusted based on feedback to balance cost and experience.

Recommendations for Others

If you’re facing similar cost challenges:

1. Establish Baseline Metrics

Before optimizing:

Tag all resources comprehensively
Monitor utilization for 2+ weeks
Document current costs by service and team
Measure performance and reliability baselines

2. Create a Cost Optimization Roadmap

Prioritize based on:

Potential savings (high impact first)
Implementation difficulty (quick wins early)
Risk level (low-risk changes first)

3. Implement Gradually

Don’t change everything at once:

Make one change at a time
Measure impact before proceeding
Be ready to roll back if issues arise

4. Communicate Transparently

Keep stakeholders informed:

Share weekly progress updates
Document savings achieved
Explain trade-offs and decisions
Celebrate wins with the team

5. Build for Sustainability

Make cost optimization part of normal operations:

Include cost in design reviews
Add cost metrics to dashboards
Review costs in team retrospectives
Share cost optimization knowledge

Tools We Used

Cost monitoring:

AWS Cost Explorer for analysis
CloudHealth for multi-cloud visibility
Custom dashboards in Grafana

Resource monitoring:

Datadog for metrics and APM
CloudWatch for AWS-specific metrics
Custom scripts for utilization analysis

Optimization:

AWS Compute Optimizer for right-sizing recommendations
AWS Trusted Advisor for best practice checks
Custom automation for scheduling

Conclusion

Reducing infrastructure costs by 60% while improving performance was challenging but achievable. The key was systematic measurement, architectural improvements, and cultural change.

Cost optimization isn’t a one-time project - it’s an ongoing practice. By building cost awareness into engineering culture and regular processes, we maintained efficiency while continuing to grow.

For teams facing similar challenges, the path is clear: measure everything, start with obvious waste, make data-driven decisions, and build sustainable practices. The results speak for themselves.

Part of our Case Studies series sharing real-world experiences building and operating production systems.

AI & ML

Building an AI Blog Writer: From Topic to Published Post with n8n, Claude, and GitHub

Developer skills

Cutting Cortex LLM Costs by 90%: The Prompt Engineering Playbook

Engineering

Infrastructure as a Fabric: How a Qdrant MCP Server Led Me to Rethink Everything

Enterprise software

Zero-Downtime Database Migrations

News & insights

From Idea to Production in 28 Days

Open Source

Personal AI Operations Memory: Building a Learning System for Git-Ops

Security

Zero-Trust Networking Patterns for Kubernetes Clusters

How We Reduced Infrastructure Costs by 60%

Starting Point: The Cost Problem

Phase 1: Measurement and Visibility

Cost Allocation Tags

Monitoring Resource Utilization

Cost Anomaly Detection

Phase 2: Low-Hanging Fruit

Development and Staging Environments

Right-Sizing Instances

Reserved Instances and Savings Plans

Storage Optimization

Phase 3: Architectural Changes

Database Optimization

CDN and Caching Strategy

Compute Architecture Changes

Phase 4: Data Transfer Optimization

Cross-AZ Transfer Elimination

Egress Optimization

Phase 5: Cultural Changes

Cost Ownership

Cost Consideration in Design

Reserved Capacity Planning

Cost Optimization KPIs

Results and Impact

Lessons Learned

Start with Visibility

Low-Hanging Fruit First

Measure Everything

Architecture Matters

Culture is Critical

Optimization is Ongoing

Common Pitfalls to Avoid

Optimizing Too Early

Sacrificing Reliability for Cost

One-Time Optimization

Ignoring Developer Experience

Recommendations for Others

1. Establish Baseline Metrics

2. Create a Cost Optimization Roadmap

3. Implement Gradually

4. Communicate Transparently

5. Build for Sustainability

Tools We Used

Conclusion