AI Infrastructure: Insights from Production
AI Infrastructure: Insights from Production
Alex Rodriguez leads the ML Infrastructure team at a company serving millions of AI-powered predictions daily. His team built the platform that trains, deploys, and serves dozens of machine learning models powering product features from recommendations to content moderation.
I spoke with Alex about the realities of running AI systems in production - the technical challenges, cost considerations, and lessons learned from two years of operating ML infrastructure at scale.
The Production Reality Check
Ryan: What’s the biggest gap between ML research and ML production?
Alex: The gap is enormous. Research focuses on model accuracy - getting an extra 2% on a benchmark. Production focuses on reliability, latency, cost, and maintainability. These concerns barely exist in research.
A model that’s 95% accurate but takes 2 seconds per prediction is useless for our use cases. We need 90% accuracy at 50ms latency. The trade-off is obvious, but that’s never the conversation in research papers.
Also, research assumes clean data, stable inputs, and single model serving. Production has messy data, adversarial inputs, model versioning, gradual rollouts, A/B testing, and monitoring. It’s a completely different problem space.
What surprised you most about production ML?
How much is just software engineering. Model training is maybe 10% of the work. The other 90% is data pipelines, feature engineering, serving infrastructure, monitoring, and operational excellence.
We spend more time on data quality than model architecture. More time on latency optimization than hyperparameter tuning. More time on cost management than squeezing out accuracy points.
If you’re coming from ML research, you need to become a good software engineer first. The ML skills matter, but they’re not sufficient.
Model Serving Architecture
Walk me through your model serving architecture.
We run on Kubernetes with a custom model serving layer:
Model registry:
- Central storage for trained models
- Version tracking and metadata
- Model lineage and reproducibility
- Approval workflow for production deployment
Serving infrastructure:
- Kubernetes-based deployment
- Horizontal autoscaling based on request rate
- GPU and CPU instances depending on model needs
- Load balancing across replicas
API layer:
- REST and gRPC endpoints
- Request validation and preprocessing
- Model routing (which model for which request)
- Response caching
- Rate limiting
Monitoring:
- Latency (p50, p95, p99)
- Error rates
- Model prediction distribution
- Feature distribution drift
- Cost per prediction
How do you handle model versioning and rollouts?
This is critical for production ML. Our approach:
Versioning strategy:
- Every model gets a unique version (semantic versioning)
- Models are immutable once deployed
- Metadata tracks training data, code version, hyperparameters
Deployment process:
- Model trained and registered
- Deployed to staging with synthetic traffic
- Shadow mode in production (predictions logged, not served)
- Canary deployment (5% traffic)
- Gradual rollout (25%, 50%, 100%)
- Old version kept for quick rollback
Rollback strategy:
- Instant rollback to previous version via traffic routing
- Keep last 3 versions deployed for rapid rollback
- Automated rollback on error rate spikes
This process takes 2-3 days for a full rollout, but it prevents disasters.
Inference Optimization
What techniques do you use to optimize inference performance?
Latency is our biggest constraint. We’ve used every trick:
Model optimization:
- Quantization: INT8 inference instead of FP32 (4x speedup)
- Pruning: Remove unnecessary model weights
- Distillation: Smaller student models trained from large teachers
- Architecture search: Find efficient architectures for our constraints
Serving optimization:
- Batching: Process multiple predictions together
- Caching: Cache predictions for identical inputs
- Request coalescing: Deduplicate concurrent identical requests
- Feature precomputation: Compute expensive features offline
Infrastructure optimization:
- GPU sharing: Multiple models per GPU
- Model compilation: TensorRT, ONNX Runtime optimizations
- Smart routing: Route requests to nearest/fastest instance
Trade-offs: Every optimization has costs. Quantization reduces accuracy slightly. Batching increases latency for individual requests. Caching can serve stale predictions.
You have to measure the impact and decide what’s acceptable for your use case.
Can you give a concrete example?
Sure. Our recommendation model initially took 300ms per prediction. That’s way too slow for a user-facing feature.
What we did:
Step 1: Model distillation
- Trained smaller model (1/10th the size)
- 300ms → 120ms
- Accuracy: 94% → 91% (acceptable trade-off)
Step 2: Quantization
- INT8 inference
- 120ms → 45ms
- Negligible accuracy impact
Step 3: Feature caching
- Precompute user features (they change slowly)
- Only compute item features on demand
- 45ms → 25ms
Step 4: Request batching
- Process 32 predictions at once
- Throughput increased 10x
- p99 latency: 50ms (acceptable)
Result:
- 300ms → 25ms average, 50ms p99
- Throughput increased 40x
- Cost per prediction dropped 80%
- Accuracy: 94% → 91% (still above our threshold)
This took 3 months of engineering, but it made the feature viable.
Cost Management
AI inference is expensive. How do you manage costs?
Cost is a huge concern. At scale, inference costs can exceed training costs by 100x.
Our strategies:
Right-sizing infrastructure:
- Autoscaling based on traffic patterns
- Different instance types for different models
- Spot instances for non-critical workloads
- Reserved capacity for baseline traffic
Model efficiency:
- Smaller models where accuracy permits
- Quantization and optimization
- Sharing embeddings across models
- Knowledge distillation
Caching aggressively:
- Cache predictions for identical inputs
- TTL based on how quickly data changes
- 60% cache hit rate saves significant compute
Smart routing:
- Cheap models for easy cases
- Expensive models only when necessary
- Cascade approach (fast model → slow model if confidence low)
Cost monitoring:
- Cost per prediction tracked per model
- Alerts on cost anomalies
- Regular cost optimization reviews
What’s your cost structure look like?
Rough breakdown for our inference costs:
- Compute (GPUs/CPUs): 60%
- Data transfer: 15%
- Storage (models, features, logs): 10%
- Monitoring and logging: 10%
- Other (API gateway, load balancers): 5%
GPU compute dominates. Any reduction there has outsized impact.
We also track cost per prediction per model:
- Simple models: $0.0001 per prediction
- Medium models: $0.001 per prediction
- Large models: $0.01 per prediction
This informs which models we use for which use cases.
Data and Feature Management
How do you handle feature engineering at scale?
Feature engineering is often the bottleneck. Our approach:
Feature store:
- Central repository for features
- Real-time and batch features
- Versioning and lineage tracking
- Point-in-time correct retrieval
Feature computation:
- Batch features: Spark jobs, updated daily/hourly
- Real-time features: Stream processing (Flink)
- On-demand features: Computed at request time
- Feature caching for expensive computations
Feature reuse:
- Features shared across models
- Reduces duplication and drift
- Improves consistency
Monitoring:
- Feature distribution monitoring
- Drift detection
- Null/missing value tracking
What about data quality issues?
Data quality is the silent killer of ML systems. Models are only as good as their data.
Common issues we see:
Missing data:
- Upstream service outage → missing features
- Model needs to handle gracefully
- We use fallback values and confidence degradation
Distribution shift:
- User behavior changes
- Product changes alter data characteristics
- Regular retraining mitigates but doesn’t eliminate
Data errors:
- Bugs in upstream systems
- Schema changes breaking pipelines
- Type mismatches
Adversarial inputs:
- Users gaming the system
- Bots and spam
- Requires ongoing adversarial training
Our approach:
- Comprehensive data validation
- Monitoring for distribution drift
- Automated data quality checks
- Regular model retraining
- Human review of predictions for critical use cases
Monitoring and Debugging
How do you monitor ML systems in production?
Monitoring ML systems is different from monitoring traditional software:
Traditional metrics:
- Latency (p50, p95, p99)
- Error rate
- Request rate
- Resource utilization
ML-specific metrics:
- Prediction distribution
- Feature distribution
- Model confidence scores
- Accuracy on labeled samples
- Data drift scores
Business metrics:
- User engagement with predictions
- Conversion rates
- Revenue impact
- User satisfaction
We alert on anomalies in all three categories.
How do you debug ML systems when things go wrong?
Debugging ML is harder than debugging traditional code:
No stack traces: Model silently produces bad predictions
Non-deterministic: Same inputs can produce different outputs (dropout, sampling)
Emergent behavior: Issues arise from data patterns, not code bugs
Our debugging approach:
Step 1: Identify the issue
- Monitoring alerts or user reports
- Gather examples of bad predictions
- Quantify the impact
Step 2: Isolate the cause
- Is it the model, features, or infrastructure?
- Check feature distributions
- Review prediction patterns
- Look for data issues
Step 3: Reproduce
- Recreate the issue in staging
- Verify hypotheses
- Test potential fixes
Step 4: Fix
- Might be model retrain
- Might be feature fix
- Might be data quality issue
- Might be infrastructure problem
Step 5: Prevent recurrence
- Add monitoring for this failure mode
- Improve data validation
- Document the incident
Can you give an example of a challenging debugging session?
We had a content moderation model that suddenly started flagging innocent content at 3x the normal rate.
Investigation:
First hypothesis: Model bug
- Checked: Model version hadn’t changed
- Not the model
Second hypothesis: Data drift
- Checked: Feature distributions looked normal
- Not obvious drift
Third hypothesis: Infrastructure issue
- Checked: Latency, error rate normal
- Not infrastructure
Actual cause: Upstream service changed emoji encoding. Our feature extraction broke for posts with certain emojis. Model saw garbled text, triggered moderation.
Fix: Updated feature extraction to handle new encoding. Added validation to catch encoding issues.
Prevention:
- Monitoring for feature extraction errors
- Validation of upstream data formats
- Integration tests with upstream services
This took 6 hours to debug. The lesson: ML bugs can hide in unexpected places.
Training and Retraining
How do you approach model retraining?
Models degrade over time as data distributions shift. Regular retraining is essential.
Our strategy:
Frequency:
- Critical models: Weekly
- Important models: Monthly
- Stable models: Quarterly
Process:
- Gather new training data
- Train new model version
- Evaluate against holdout set
- Compare to production model
- Deploy via standard rollout process
Automated vs. manual:
- Non-critical models: Fully automated
- Critical models: Human approval required
- All models: Automated monitoring for issues
Challenges:
- Training data labeling
- Label quality and consistency
- Concept drift in labels
- Bias in new data
How do you handle training infrastructure?
Training is compute-intensive. Our approach:
Infrastructure:
- Kubernetes-based training platform
- GPU clusters (V100s and A100s)
- Spot instances for cost savings
- Distributed training for large models
Orchestration:
- Kubeflow for pipeline management
- MLflow for experiment tracking
- DVC for data versioning
- Weight & Biases for monitoring
Cost optimization:
- Spot instances (70% cost savings)
- Training during off-peak hours
- Efficient hyperparameter search
- Early stopping for poor models
Lessons Learned
What’s your biggest lesson from two years of production ML?
Start simple. Don’t use deep learning when logistic regression works. Don’t use GPUs when CPUs are fine. Don’t build custom infrastructure when existing tools work.
Every layer of complexity is ongoing maintenance burden. Every dependency is potential failure point. The simplest solution that meets requirements is usually best.
We’ve replaced complex deep learning models with simpler models multiple times. The complex model was 2% better on accuracy but 10x more expensive and 5x harder to maintain. Not worth it.
What would you do differently if starting over?
Invest in infrastructure earlier. We initially focused on model accuracy, treating infrastructure as an afterthought. This created technical debt that took months to pay down.
If starting over, I’d:
- Build proper feature store from day one
- Invest in monitoring and observability early
- Establish data quality checks immediately
- Create standard deployment pipelines upfront
- Set up cost tracking from the beginning
The cost of retrofitting these is much higher than building them correctly initially.
What advice for companies starting ML in production?
Don’t start with ML. Seriously. Solve the problem without ML first. Many problems don’t need ML - rules, heuristics, or simple statistics often work fine.
If you need ML:
- Start with simplest model that could work
- Measure business impact, not just accuracy
- Invest in data quality and pipelines
- Build monitoring and observability
- Plan for retraining and versioning
- Budget for ongoing costs
Common mistakes to avoid:
- Using latest research models in production
- Neglecting data quality
- Underestimating inference costs
- Skipping proper monitoring
- Not planning for model degradation
The Future of ML Infrastructure
Where is ML infrastructure headed?
Specialization: General-purpose infrastructure is giving way to specialized systems for different ML workloads. Image models need different infrastructure than LLMs.
Edge deployment: More inference moving to edge devices. Privacy, latency, and cost drive this. Infrastructure must support edge deployment.
Automated MLOps: Manual processes become automated. AutoML extends to AutoMLOps - automated training, deployment, monitoring, and retraining.
Cost optimization: As ML becomes ubiquitous, cost pressure increases. Infrastructure that optimizes cost while maintaining quality wins.
Observability: Understanding model behavior becomes more sophisticated. Explainability and interpretability tools become standard.
What are you most excited about?
I’m excited about infrastructure that makes ML accessible to more engineers. Right now, production ML requires specialized expertise. Better abstractions will democratize it.
Imagine deploying a model as easily as deploying a REST API. Monitoring, scaling, versioning, and cost optimization all handled automatically. Engineers focus on business logic, not infrastructure.
We’re not there yet, but we’re getting closer. That’s the future I’m working toward.
Final Thoughts
Any closing advice?
Production ML is more engineering than ML. Focus on fundamentals: data quality, monitoring, cost management, and operational excellence. The fancy model matters less than you think.
Be humble. Production will humble you quickly. Models that work great offline fail in production. The best accuracy on a benchmark doesn’t mean best business outcome.
And invest in your infrastructure. It’s the foundation everything else is built on. Cutting corners here creates technical debt that compounds over time.
Thank you to Alex Rodriguez for sharing insights from running ML infrastructure at scale. His experience reflects the reality of production ML systems beyond research papers and demos.