Skip to main content

How We Reduced Infrastructure Costs by 60%

Ryan Dahlberg
Ryan Dahlberg
October 5, 2025 11 min read
Share:
How We Reduced Infrastructure Costs by 60%

How We Reduced Infrastructure Costs by 60%

Six months ago, our infrastructure costs hit $150K per month. We served 2 million users with solid performance and reliability, but the economics weren’t sustainable. Leadership asked us to cut costs without compromising user experience.

This case study details how we reduced costs by 60% to $60K monthly while actually improving performance and reliability. The journey involved systematic measurement, architectural changes, and cultural shifts in how we thought about infrastructure spending.

Starting Point: The Cost Problem

Our infrastructure ran on AWS with a typical microservices architecture:

  • 200+ EC2 instances across development, staging, and production
  • RDS PostgreSQL with read replicas
  • ElastiCache Redis for caching and sessions
  • S3 for object storage
  • CloudFront CDN for static assets
  • Application Load Balancers
  • Various AWS services (SQS, SNS, Lambda, etc.)

Monthly breakdown:

  • Compute (EC2): $75K
  • Database (RDS): $32K
  • Cache (ElastiCache): $18K
  • Data transfer: $15K
  • Storage (S3, EBS): $7K
  • Other services: $3K

We had grown organically without cost optimization focus. Engineers provisioned resources based on perceived needs rather than actual usage. Over-provisioning was common and considered “safe.”

Phase 1: Measurement and Visibility

You can’t optimize what you don’t measure. Our first step was understanding actual resource utilization.

Cost Allocation Tags

We implemented comprehensive tagging:

Environment: production | staging | development
Service: api | worker | frontend | etc.
Team: platform | product | data
Project: feature-name

Tags enabled cost analysis by service, team, and environment. This immediately revealed surprises:

  • Staging environment cost 40% as much as production
  • Three services accounted for 60% of compute costs
  • Development environments ran 24/7 despite only being used during work hours

Monitoring Resource Utilization

We deployed detailed monitoring:

  • CPU utilization per instance
  • Memory usage patterns
  • Network I/O
  • Disk I/O and usage
  • Database query patterns
  • Cache hit rates

The data showed systematic over-provisioning:

  • Average CPU utilization: 18%
  • Average memory utilization: 35%
  • Many instances had <5% CPU usage
  • Database instances provisioned for 10x actual load

Cost Anomaly Detection

We set up automated anomaly detection:

If cost increases >20% day-over-day:
  Alert team leads
  Include breakdown by service
  Compare to historical baseline

This prevented cost surprises and caught issues like:

  • Accidentally leaving large EC2 instances running
  • Data transfer spikes from misconfigurations
  • Zombie resources that should have been terminated

Phase 2: Low-Hanging Fruit

With visibility established, we tackled obvious waste.

Development and Staging Environments

Development environments only needed to run during work hours. We implemented:

Automated scheduling:

  • Start at 8 AM local time
  • Stop at 8 PM local time
  • Keep stopped over weekends

Result: $18K monthly savings from development environments alone.

For staging, we implemented:

  • On-demand staging environments created per feature branch
  • Automatic teardown after PR merge
  • Reduced standing staging infrastructure by 70%

Result: $12K monthly savings from staging optimization.

Right-Sizing Instances

We analyzed actual resource usage and right-sized instances:

API service:

  • Was: m5.2xlarge (8 vCPU, 32GB RAM)
  • Usage: 15% CPU, 8GB RAM
  • Now: m5.large (2 vCPU, 8GB RAM)
  • Savings per instance: 75%

We applied right-sizing systematically:

  • Reduced instance sizes for 80% of services
  • Increased only 3 services that were undersized
  • Added more smaller instances where needed for redundancy

Result: $28K monthly savings from right-sizing.

Reserved Instances and Savings Plans

For predictable workloads, we purchased reserved capacity:

  • 1-year reserved instances for production databases
  • Compute savings plans for baseline compute needs
  • Spot instances for fault-tolerant batch workloads

Result: $8K monthly savings from commitment-based pricing.

Storage Optimization

S3 held 50TB of data, much of it rarely accessed:

Implemented lifecycle policies:

Age 0-30 days: Standard (frequent access)
Age 31-90 days: Infrequent Access
Age 91-365 days: Glacier Instant Retrieval
Age 365+ days: Glacier Deep Archive

Deleted unused data:

  • Old build artifacts
  • Temporary files never cleaned up
  • Redundant backups
  • Test data in production

Result: $3K monthly savings from storage optimization.

Phase 3: Architectural Changes

Low-hanging fruit gave us 46% savings ($69K remaining). Further optimization required architectural changes.

Database Optimization

Our largest RDS instance cost $15K monthly:

Analysis revealed:

  • Read queries dominated (95% reads, 5% writes)
  • Many queries fetched unnecessary columns
  • N+1 query patterns common
  • No query caching layer

Optimization approach:

Added read-through cache:

def get_user(user_id):
    # Check cache first
    cached = redis.get(f"user:{user_id}")
    if cached:
        return cached

    # Cache miss - query database
    user = db.query("SELECT * FROM users WHERE id = ?", user_id)
    redis.setex(f"user:{user_id}", 3600, user)
    return user

This pattern applied to hot data paths reduced database load by 70%.

Query optimization:

  • Fixed N+1 patterns with proper joins
  • Added database indexes for common queries
  • Used SELECT specific columns instead of SELECT *
  • Batch queries where possible

Connection pooling:

  • Reduced max connections from 500 to 100
  • Improved connection reuse
  • Eliminated connection storms during traffic spikes

Results:

  • Reduced RDS instance from db.r5.4xlarge to db.r5.xlarge
  • Eliminated two read replicas (cache absorbed read traffic)
  • Improved query response time by 40%
  • Savings: $18K monthly

CDN and Caching Strategy

We paid significant data transfer costs for assets that could be cached:

Before:

  • CloudFront TTL: 5 minutes
  • Many assets not cached at all
  • No cache invalidation strategy

After:

  • Static assets: 1 year TTL with versioned filenames
  • API responses: 1 hour TTL where appropriate
  • Proper cache headers on all responses
  • Automated cache invalidation on deployment

Results:

  • Origin requests dropped 85%
  • Data transfer costs decreased 60%
  • Page load time improved 30%
  • Savings: $9K monthly

Compute Architecture Changes

Our microservices architecture had inefficiencies:

Service consolidation: Some microservices were over-engineered. We consolidated:

  • 5 simple microservices → 1 service with multiple modules
  • Reduced inter-service communication overhead
  • Simplified deployment and monitoring
  • Fewer EC2 instances needed

Serverless migration: For specific workloads, Lambda was more cost-effective:

  • Batch processing jobs
  • Scheduled tasks
  • Webhook handlers
  • Infrequent administrative tasks

Migrating these reduced EC2 instance count by 20.

Horizontal autoscaling tuning: Our autoscaling was conservative:

  • Minimum instances: 3
  • Scale up threshold: 60% CPU

New configuration:

  • Minimum instances: 2
  • Scale up threshold: 70% CPU
  • Faster scale-down after traffic drops

Results:

  • Reduced average instance count by 30%
  • Improved resource utilization to 50%+ CPU average
  • Maintained p99 latency SLA
  • Savings: $12K monthly

Phase 4: Data Transfer Optimization

Data transfer costs were high due to inefficient patterns.

Cross-AZ Transfer Elimination

Services communicated across availability zones unnecessarily:

Before:

  • Services randomly placed across 3 AZs
  • Every service call potentially crossed AZ boundaries
  • Cross-AZ transfer: $0.01/GB each way

After:

  • Services in same AZ communicate within AZ
  • Cross-AZ only for redundancy, not normal traffic
  • Reduced cross-AZ transfer by 80%

Savings: $4K monthly

Egress Optimization

We paid for outbound data transfer that could be avoided:

Implemented:

  • Compression for all API responses (gzip)
  • Image optimization (WebP format, proper sizing)
  • Aggressive CDN caching
  • Binary protocol for internal services (vs. JSON)

Results:

  • Average response size reduced 60%
  • Egress costs decreased 50%
  • Savings: $3K monthly

Phase 5: Cultural Changes

Technical optimizations got us to $60K monthly. Sustaining this required cultural shifts.

Cost Ownership

We assigned cost ownership to engineering teams:

  • Each team had a monthly cost budget
  • Teams received weekly cost reports
  • Cost reduction counted in performance reviews

This made cost everyone’s responsibility, not just the platform team’s.

Cost Consideration in Design

We updated our design review process:

New services required:

  • Estimated monthly cost
  • Justification for resource choices
  • Consideration of serverless alternatives
  • Auto-scaling strategy

This prevented new services from repeating old mistakes.

Reserved Capacity Planning

We established quarterly planning:

  • Review actual usage patterns
  • Purchase reserved instances for predictable load
  • Right-size existing reservations
  • Forecast capacity needs

This maximized savings plan benefits while remaining flexible.

Cost Optimization KPIs

We tracked cost efficiency metrics:

  • Cost per user: Total infrastructure cost / monthly active users
  • Cost per transaction: Infrastructure cost / API requests
  • Waste percentage: Unused capacity / total capacity

These metrics showed whether we were improving efficiency over time.

Results and Impact

After six months of optimization:

Cost reduction:

  • Starting: $150K/month
  • Ending: $60K/month
  • Reduction: 60%
  • Annual savings: $1.08M

Cost breakdown after optimization:

  • Compute (EC2): $25K (was $75K)
  • Database (RDS): $14K (was $32K)
  • Cache (ElastiCache): $10K (was $18K)
  • Data transfer: $6K (was $15K)
  • Storage (S3, EBS): $4K (was $7K)
  • Other services: $1K (was $3K)

Performance improvements:

  • p99 latency: 450ms → 320ms (29% improvement)
  • Cache hit rate: 40% → 85%
  • Database query time: 180ms → 110ms (39% improvement)
  • Page load time: 2.1s → 1.5s (29% improvement)

Reliability improvements:

  • Monthly incidents: 3.5 → 1.2 (66% reduction)
  • MTTR: 32 min → 18 min (44% improvement)
  • Uptime: 99.8% → 99.95%

Lessons Learned

Start with Visibility

We couldn’t optimize without understanding where money went. Comprehensive tagging and monitoring were essential.

Low-Hanging Fruit First

Development environment scheduling and right-sizing gave us quick wins that built momentum. Start with obvious waste.

Measure Everything

We tracked cost, performance, and reliability throughout. This prevented optimizations that saved money but hurt user experience.

Architecture Matters

The biggest savings came from architectural changes: caching strategy, database optimization, and service consolidation.

Culture is Critical

Technical changes were necessary but insufficient. Cost awareness had to become part of engineering culture.

Optimization is Ongoing

Costs drift upward without continued attention. We established quarterly reviews to maintain efficiency.

Common Pitfalls to Avoid

Optimizing Too Early

Premature optimization wastes time. Wait until costs justify optimization effort.

Sacrificing Reliability for Cost

We never compromised redundancy, monitoring, or backup strategies. Reliability pays for itself.

One-Time Optimization

Without ongoing attention, costs creep back up. Build optimization into regular processes.

Ignoring Developer Experience

Development environment scheduling saved money but initially frustrated developers. We adjusted based on feedback to balance cost and experience.

Recommendations for Others

If you’re facing similar cost challenges:

1. Establish Baseline Metrics

Before optimizing:

  • Tag all resources comprehensively
  • Monitor utilization for 2+ weeks
  • Document current costs by service and team
  • Measure performance and reliability baselines

2. Create a Cost Optimization Roadmap

Prioritize based on:

  • Potential savings (high impact first)
  • Implementation difficulty (quick wins early)
  • Risk level (low-risk changes first)

3. Implement Gradually

Don’t change everything at once:

  • Make one change at a time
  • Measure impact before proceeding
  • Be ready to roll back if issues arise

4. Communicate Transparently

Keep stakeholders informed:

  • Share weekly progress updates
  • Document savings achieved
  • Explain trade-offs and decisions
  • Celebrate wins with the team

5. Build for Sustainability

Make cost optimization part of normal operations:

  • Include cost in design reviews
  • Add cost metrics to dashboards
  • Review costs in team retrospectives
  • Share cost optimization knowledge

Tools We Used

Cost monitoring:

  • AWS Cost Explorer for analysis
  • CloudHealth for multi-cloud visibility
  • Custom dashboards in Grafana

Resource monitoring:

  • Datadog for metrics and APM
  • CloudWatch for AWS-specific metrics
  • Custom scripts for utilization analysis

Optimization:

  • AWS Compute Optimizer for right-sizing recommendations
  • AWS Trusted Advisor for best practice checks
  • Custom automation for scheduling

Conclusion

Reducing infrastructure costs by 60% while improving performance was challenging but achievable. The key was systematic measurement, architectural improvements, and cultural change.

Cost optimization isn’t a one-time project - it’s an ongoing practice. By building cost awareness into engineering culture and regular processes, we maintained efficiency while continuing to grow.

For teams facing similar challenges, the path is clear: measure everything, start with obvious waste, make data-driven decisions, and build sustainable practices. The results speak for themselves.


Part of our Case Studies series sharing real-world experiences building and operating production systems.

#Case Studies #Cost Optimization #Cloud Infrastructure #AWS #Performance