How We Reduced Infrastructure Costs by 60%
How We Reduced Infrastructure Costs by 60%
Six months ago, our infrastructure costs hit $150K per month. We served 2 million users with solid performance and reliability, but the economics weren’t sustainable. Leadership asked us to cut costs without compromising user experience.
This case study details how we reduced costs by 60% to $60K monthly while actually improving performance and reliability. The journey involved systematic measurement, architectural changes, and cultural shifts in how we thought about infrastructure spending.
Starting Point: The Cost Problem
Our infrastructure ran on AWS with a typical microservices architecture:
- 200+ EC2 instances across development, staging, and production
- RDS PostgreSQL with read replicas
- ElastiCache Redis for caching and sessions
- S3 for object storage
- CloudFront CDN for static assets
- Application Load Balancers
- Various AWS services (SQS, SNS, Lambda, etc.)
Monthly breakdown:
- Compute (EC2): $75K
- Database (RDS): $32K
- Cache (ElastiCache): $18K
- Data transfer: $15K
- Storage (S3, EBS): $7K
- Other services: $3K
We had grown organically without cost optimization focus. Engineers provisioned resources based on perceived needs rather than actual usage. Over-provisioning was common and considered “safe.”
Phase 1: Measurement and Visibility
You can’t optimize what you don’t measure. Our first step was understanding actual resource utilization.
Cost Allocation Tags
We implemented comprehensive tagging:
Environment: production | staging | development
Service: api | worker | frontend | etc.
Team: platform | product | data
Project: feature-name
Tags enabled cost analysis by service, team, and environment. This immediately revealed surprises:
- Staging environment cost 40% as much as production
- Three services accounted for 60% of compute costs
- Development environments ran 24/7 despite only being used during work hours
Monitoring Resource Utilization
We deployed detailed monitoring:
- CPU utilization per instance
- Memory usage patterns
- Network I/O
- Disk I/O and usage
- Database query patterns
- Cache hit rates
The data showed systematic over-provisioning:
- Average CPU utilization: 18%
- Average memory utilization: 35%
- Many instances had <5% CPU usage
- Database instances provisioned for 10x actual load
Cost Anomaly Detection
We set up automated anomaly detection:
If cost increases >20% day-over-day:
Alert team leads
Include breakdown by service
Compare to historical baseline
This prevented cost surprises and caught issues like:
- Accidentally leaving large EC2 instances running
- Data transfer spikes from misconfigurations
- Zombie resources that should have been terminated
Phase 2: Low-Hanging Fruit
With visibility established, we tackled obvious waste.
Development and Staging Environments
Development environments only needed to run during work hours. We implemented:
Automated scheduling:
- Start at 8 AM local time
- Stop at 8 PM local time
- Keep stopped over weekends
Result: $18K monthly savings from development environments alone.
For staging, we implemented:
- On-demand staging environments created per feature branch
- Automatic teardown after PR merge
- Reduced standing staging infrastructure by 70%
Result: $12K monthly savings from staging optimization.
Right-Sizing Instances
We analyzed actual resource usage and right-sized instances:
API service:
- Was: m5.2xlarge (8 vCPU, 32GB RAM)
- Usage: 15% CPU, 8GB RAM
- Now: m5.large (2 vCPU, 8GB RAM)
- Savings per instance: 75%
We applied right-sizing systematically:
- Reduced instance sizes for 80% of services
- Increased only 3 services that were undersized
- Added more smaller instances where needed for redundancy
Result: $28K monthly savings from right-sizing.
Reserved Instances and Savings Plans
For predictable workloads, we purchased reserved capacity:
- 1-year reserved instances for production databases
- Compute savings plans for baseline compute needs
- Spot instances for fault-tolerant batch workloads
Result: $8K monthly savings from commitment-based pricing.
Storage Optimization
S3 held 50TB of data, much of it rarely accessed:
Implemented lifecycle policies:
Age 0-30 days: Standard (frequent access)
Age 31-90 days: Infrequent Access
Age 91-365 days: Glacier Instant Retrieval
Age 365+ days: Glacier Deep Archive
Deleted unused data:
- Old build artifacts
- Temporary files never cleaned up
- Redundant backups
- Test data in production
Result: $3K monthly savings from storage optimization.
Phase 3: Architectural Changes
Low-hanging fruit gave us 46% savings ($69K remaining). Further optimization required architectural changes.
Database Optimization
Our largest RDS instance cost $15K monthly:
Analysis revealed:
- Read queries dominated (95% reads, 5% writes)
- Many queries fetched unnecessary columns
- N+1 query patterns common
- No query caching layer
Optimization approach:
Added read-through cache:
def get_user(user_id):
# Check cache first
cached = redis.get(f"user:{user_id}")
if cached:
return cached
# Cache miss - query database
user = db.query("SELECT * FROM users WHERE id = ?", user_id)
redis.setex(f"user:{user_id}", 3600, user)
return user
This pattern applied to hot data paths reduced database load by 70%.
Query optimization:
- Fixed N+1 patterns with proper joins
- Added database indexes for common queries
- Used SELECT specific columns instead of SELECT *
- Batch queries where possible
Connection pooling:
- Reduced max connections from 500 to 100
- Improved connection reuse
- Eliminated connection storms during traffic spikes
Results:
- Reduced RDS instance from db.r5.4xlarge to db.r5.xlarge
- Eliminated two read replicas (cache absorbed read traffic)
- Improved query response time by 40%
- Savings: $18K monthly
CDN and Caching Strategy
We paid significant data transfer costs for assets that could be cached:
Before:
- CloudFront TTL: 5 minutes
- Many assets not cached at all
- No cache invalidation strategy
After:
- Static assets: 1 year TTL with versioned filenames
- API responses: 1 hour TTL where appropriate
- Proper cache headers on all responses
- Automated cache invalidation on deployment
Results:
- Origin requests dropped 85%
- Data transfer costs decreased 60%
- Page load time improved 30%
- Savings: $9K monthly
Compute Architecture Changes
Our microservices architecture had inefficiencies:
Service consolidation: Some microservices were over-engineered. We consolidated:
- 5 simple microservices → 1 service with multiple modules
- Reduced inter-service communication overhead
- Simplified deployment and monitoring
- Fewer EC2 instances needed
Serverless migration: For specific workloads, Lambda was more cost-effective:
- Batch processing jobs
- Scheduled tasks
- Webhook handlers
- Infrequent administrative tasks
Migrating these reduced EC2 instance count by 20.
Horizontal autoscaling tuning: Our autoscaling was conservative:
- Minimum instances: 3
- Scale up threshold: 60% CPU
New configuration:
- Minimum instances: 2
- Scale up threshold: 70% CPU
- Faster scale-down after traffic drops
Results:
- Reduced average instance count by 30%
- Improved resource utilization to 50%+ CPU average
- Maintained p99 latency SLA
- Savings: $12K monthly
Phase 4: Data Transfer Optimization
Data transfer costs were high due to inefficient patterns.
Cross-AZ Transfer Elimination
Services communicated across availability zones unnecessarily:
Before:
- Services randomly placed across 3 AZs
- Every service call potentially crossed AZ boundaries
- Cross-AZ transfer: $0.01/GB each way
After:
- Services in same AZ communicate within AZ
- Cross-AZ only for redundancy, not normal traffic
- Reduced cross-AZ transfer by 80%
Savings: $4K monthly
Egress Optimization
We paid for outbound data transfer that could be avoided:
Implemented:
- Compression for all API responses (gzip)
- Image optimization (WebP format, proper sizing)
- Aggressive CDN caching
- Binary protocol for internal services (vs. JSON)
Results:
- Average response size reduced 60%
- Egress costs decreased 50%
- Savings: $3K monthly
Phase 5: Cultural Changes
Technical optimizations got us to $60K monthly. Sustaining this required cultural shifts.
Cost Ownership
We assigned cost ownership to engineering teams:
- Each team had a monthly cost budget
- Teams received weekly cost reports
- Cost reduction counted in performance reviews
This made cost everyone’s responsibility, not just the platform team’s.
Cost Consideration in Design
We updated our design review process:
New services required:
- Estimated monthly cost
- Justification for resource choices
- Consideration of serverless alternatives
- Auto-scaling strategy
This prevented new services from repeating old mistakes.
Reserved Capacity Planning
We established quarterly planning:
- Review actual usage patterns
- Purchase reserved instances for predictable load
- Right-size existing reservations
- Forecast capacity needs
This maximized savings plan benefits while remaining flexible.
Cost Optimization KPIs
We tracked cost efficiency metrics:
- Cost per user: Total infrastructure cost / monthly active users
- Cost per transaction: Infrastructure cost / API requests
- Waste percentage: Unused capacity / total capacity
These metrics showed whether we were improving efficiency over time.
Results and Impact
After six months of optimization:
Cost reduction:
- Starting: $150K/month
- Ending: $60K/month
- Reduction: 60%
- Annual savings: $1.08M
Cost breakdown after optimization:
- Compute (EC2): $25K (was $75K)
- Database (RDS): $14K (was $32K)
- Cache (ElastiCache): $10K (was $18K)
- Data transfer: $6K (was $15K)
- Storage (S3, EBS): $4K (was $7K)
- Other services: $1K (was $3K)
Performance improvements:
- p99 latency: 450ms → 320ms (29% improvement)
- Cache hit rate: 40% → 85%
- Database query time: 180ms → 110ms (39% improvement)
- Page load time: 2.1s → 1.5s (29% improvement)
Reliability improvements:
- Monthly incidents: 3.5 → 1.2 (66% reduction)
- MTTR: 32 min → 18 min (44% improvement)
- Uptime: 99.8% → 99.95%
Lessons Learned
Start with Visibility
We couldn’t optimize without understanding where money went. Comprehensive tagging and monitoring were essential.
Low-Hanging Fruit First
Development environment scheduling and right-sizing gave us quick wins that built momentum. Start with obvious waste.
Measure Everything
We tracked cost, performance, and reliability throughout. This prevented optimizations that saved money but hurt user experience.
Architecture Matters
The biggest savings came from architectural changes: caching strategy, database optimization, and service consolidation.
Culture is Critical
Technical changes were necessary but insufficient. Cost awareness had to become part of engineering culture.
Optimization is Ongoing
Costs drift upward without continued attention. We established quarterly reviews to maintain efficiency.
Common Pitfalls to Avoid
Optimizing Too Early
Premature optimization wastes time. Wait until costs justify optimization effort.
Sacrificing Reliability for Cost
We never compromised redundancy, monitoring, or backup strategies. Reliability pays for itself.
One-Time Optimization
Without ongoing attention, costs creep back up. Build optimization into regular processes.
Ignoring Developer Experience
Development environment scheduling saved money but initially frustrated developers. We adjusted based on feedback to balance cost and experience.
Recommendations for Others
If you’re facing similar cost challenges:
1. Establish Baseline Metrics
Before optimizing:
- Tag all resources comprehensively
- Monitor utilization for 2+ weeks
- Document current costs by service and team
- Measure performance and reliability baselines
2. Create a Cost Optimization Roadmap
Prioritize based on:
- Potential savings (high impact first)
- Implementation difficulty (quick wins early)
- Risk level (low-risk changes first)
3. Implement Gradually
Don’t change everything at once:
- Make one change at a time
- Measure impact before proceeding
- Be ready to roll back if issues arise
4. Communicate Transparently
Keep stakeholders informed:
- Share weekly progress updates
- Document savings achieved
- Explain trade-offs and decisions
- Celebrate wins with the team
5. Build for Sustainability
Make cost optimization part of normal operations:
- Include cost in design reviews
- Add cost metrics to dashboards
- Review costs in team retrospectives
- Share cost optimization knowledge
Tools We Used
Cost monitoring:
- AWS Cost Explorer for analysis
- CloudHealth for multi-cloud visibility
- Custom dashboards in Grafana
Resource monitoring:
- Datadog for metrics and APM
- CloudWatch for AWS-specific metrics
- Custom scripts for utilization analysis
Optimization:
- AWS Compute Optimizer for right-sizing recommendations
- AWS Trusted Advisor for best practice checks
- Custom automation for scheduling
Conclusion
Reducing infrastructure costs by 60% while improving performance was challenging but achievable. The key was systematic measurement, architectural improvements, and cultural change.
Cost optimization isn’t a one-time project - it’s an ongoing practice. By building cost awareness into engineering culture and regular processes, we maintained efficiency while continuing to grow.
For teams facing similar challenges, the path is clear: measure everything, start with obvious waste, make data-driven decisions, and build sustainable practices. The results speak for themselves.
Part of our Case Studies series sharing real-world experiences building and operating production systems.