AI Infrastructure: Insights from Production

Alex Rodriguez leads the ML Infrastructure team at a company serving millions of AI-powered predictions daily. His team built the platform that trains, deploys, and serves dozens of machine learning models powering product features from recommendations to content moderation.

I spoke with Alex about the realities of running AI systems in production - the technical challenges, cost considerations, and lessons learned from two years of operating ML infrastructure at scale.

The Production Reality Check

Ryan: What’s the biggest gap between ML research and ML production?

Alex: The gap is enormous. Research focuses on model accuracy - getting an extra 2% on a benchmark. Production focuses on reliability, latency, cost, and maintainability. These concerns barely exist in research.

A model that’s 95% accurate but takes 2 seconds per prediction is useless for our use cases. We need 90% accuracy at 50ms latency. The trade-off is obvious, but that’s never the conversation in research papers.

Also, research assumes clean data, stable inputs, and single model serving. Production has messy data, adversarial inputs, model versioning, gradual rollouts, A/B testing, and monitoring. It’s a completely different problem space.

What surprised you most about production ML?

How much is just software engineering. Model training is maybe 10% of the work. The other 90% is data pipelines, feature engineering, serving infrastructure, monitoring, and operational excellence.

We spend more time on data quality than model architecture. More time on latency optimization than hyperparameter tuning. More time on cost management than squeezing out accuracy points.

If you’re coming from ML research, you need to become a good software engineer first. The ML skills matter, but they’re not sufficient.

Model Serving Architecture

Walk me through your model serving architecture.

We run on Kubernetes with a custom model serving layer:

Model registry:

Central storage for trained models
Version tracking and metadata
Model lineage and reproducibility
Approval workflow for production deployment

Serving infrastructure:

Kubernetes-based deployment
Horizontal autoscaling based on request rate
GPU and CPU instances depending on model needs
Load balancing across replicas

API layer:

REST and gRPC endpoints
Request validation and preprocessing
Model routing (which model for which request)
Response caching
Rate limiting

Monitoring:

Latency (p50, p95, p99)
Error rates
Model prediction distribution
Feature distribution drift
Cost per prediction

How do you handle model versioning and rollouts?

This is critical for production ML. Our approach:

Versioning strategy:

Every model gets a unique version (semantic versioning)
Models are immutable once deployed
Metadata tracks training data, code version, hyperparameters

Deployment process:

Model trained and registered
Deployed to staging with synthetic traffic
Shadow mode in production (predictions logged, not served)
Canary deployment (5% traffic)
Gradual rollout (25%, 50%, 100%)
Old version kept for quick rollback

Rollback strategy:

Instant rollback to previous version via traffic routing
Keep last 3 versions deployed for rapid rollback
Automated rollback on error rate spikes

This process takes 2-3 days for a full rollout, but it prevents disasters.

Inference Optimization

What techniques do you use to optimize inference performance?

Latency is our biggest constraint. We’ve used every trick:

Model optimization:

Quantization: INT8 inference instead of FP32 (4x speedup)
Pruning: Remove unnecessary model weights
Distillation: Smaller student models trained from large teachers
Architecture search: Find efficient architectures for our constraints

Serving optimization:

Batching: Process multiple predictions together
Caching: Cache predictions for identical inputs
Request coalescing: Deduplicate concurrent identical requests
Feature precomputation: Compute expensive features offline

Infrastructure optimization:

GPU sharing: Multiple models per GPU
Model compilation: TensorRT, ONNX Runtime optimizations
Smart routing: Route requests to nearest/fastest instance

Trade-offs: Every optimization has costs. Quantization reduces accuracy slightly. Batching increases latency for individual requests. Caching can serve stale predictions.

You have to measure the impact and decide what’s acceptable for your use case.

Can you give a concrete example?

Sure. Our recommendation model initially took 300ms per prediction. That’s way too slow for a user-facing feature.

What we did:

Step 1: Model distillation

Trained smaller model (1/10th the size)
300ms → 120ms
Accuracy: 94% → 91% (acceptable trade-off)

Step 2: Quantization

INT8 inference
120ms → 45ms
Negligible accuracy impact

Step 3: Feature caching

Precompute user features (they change slowly)
Only compute item features on demand
45ms → 25ms

Step 4: Request batching

Process 32 predictions at once
Throughput increased 10x
p99 latency: 50ms (acceptable)

Result:

300ms → 25ms average, 50ms p99
Throughput increased 40x
Cost per prediction dropped 80%
Accuracy: 94% → 91% (still above our threshold)

This took 3 months of engineering, but it made the feature viable.

Cost Management

AI inference is expensive. How do you manage costs?

Cost is a huge concern. At scale, inference costs can exceed training costs by 100x.

Our strategies:

Right-sizing infrastructure:

Autoscaling based on traffic patterns
Different instance types for different models
Spot instances for non-critical workloads
Reserved capacity for baseline traffic

Model efficiency:

Smaller models where accuracy permits
Quantization and optimization
Sharing embeddings across models
Knowledge distillation

Caching aggressively:

Cache predictions for identical inputs
TTL based on how quickly data changes
60% cache hit rate saves significant compute

Smart routing:

Cheap models for easy cases
Expensive models only when necessary
Cascade approach (fast model → slow model if confidence low)

Cost monitoring:

Cost per prediction tracked per model
Alerts on cost anomalies
Regular cost optimization reviews

What’s your cost structure look like?

Rough breakdown for our inference costs:

Compute (GPUs/CPUs): 60%
Data transfer: 15%
Storage (models, features, logs): 10%
Monitoring and logging: 10%
Other (API gateway, load balancers): 5%

GPU compute dominates. Any reduction there has outsized impact.

We also track cost per prediction per model:

Simple models: $0.0001 per prediction
Medium models: $0.001 per prediction
Large models: $0.01 per prediction

This informs which models we use for which use cases.

Data and Feature Management

How do you handle feature engineering at scale?

Feature engineering is often the bottleneck. Our approach:

Feature store:

Central repository for features
Real-time and batch features
Versioning and lineage tracking
Point-in-time correct retrieval

Feature computation:

Batch features: Spark jobs, updated daily/hourly
Real-time features: Stream processing (Flink)
On-demand features: Computed at request time
Feature caching for expensive computations

Feature reuse:

Features shared across models
Reduces duplication and drift
Improves consistency

Monitoring:

Feature distribution monitoring
Drift detection
Null/missing value tracking

What about data quality issues?

Data quality is the silent killer of ML systems. Models are only as good as their data.

Common issues we see:

Missing data:

Upstream service outage → missing features
Model needs to handle gracefully
We use fallback values and confidence degradation

Distribution shift:

User behavior changes
Product changes alter data characteristics
Regular retraining mitigates but doesn’t eliminate

Data errors:

Bugs in upstream systems
Schema changes breaking pipelines
Type mismatches

Adversarial inputs:

Users gaming the system
Bots and spam
Requires ongoing adversarial training

Our approach:

Comprehensive data validation
Monitoring for distribution drift
Automated data quality checks
Regular model retraining
Human review of predictions for critical use cases

Monitoring and Debugging

How do you monitor ML systems in production?

Monitoring ML systems is different from monitoring traditional software:

Traditional metrics:

Latency (p50, p95, p99)
Error rate
Request rate
Resource utilization

ML-specific metrics:

Prediction distribution
Feature distribution
Model confidence scores
Accuracy on labeled samples
Data drift scores

Business metrics:

User engagement with predictions
Conversion rates
Revenue impact
User satisfaction

We alert on anomalies in all three categories.

How do you debug ML systems when things go wrong?

Debugging ML is harder than debugging traditional code:

No stack traces: Model silently produces bad predictions

Non-deterministic: Same inputs can produce different outputs (dropout, sampling)

Emergent behavior: Issues arise from data patterns, not code bugs

Our debugging approach:

Step 1: Identify the issue

Monitoring alerts or user reports
Gather examples of bad predictions
Quantify the impact

Step 2: Isolate the cause

Is it the model, features, or infrastructure?
Check feature distributions
Review prediction patterns
Look for data issues

Step 3: Reproduce

Recreate the issue in staging
Verify hypotheses
Test potential fixes

Step 4: Fix

Might be model retrain
Might be feature fix
Might be data quality issue
Might be infrastructure problem

Step 5: Prevent recurrence

Add monitoring for this failure mode
Improve data validation
Document the incident

Can you give an example of a challenging debugging session?

We had a content moderation model that suddenly started flagging innocent content at 3x the normal rate.

Investigation:

First hypothesis: Model bug

Checked: Model version hadn’t changed
Not the model

Second hypothesis: Data drift

Checked: Feature distributions looked normal
Not obvious drift

Third hypothesis: Infrastructure issue

Checked: Latency, error rate normal
Not infrastructure

Actual cause: Upstream service changed emoji encoding. Our feature extraction broke for posts with certain emojis. Model saw garbled text, triggered moderation.

Fix: Updated feature extraction to handle new encoding. Added validation to catch encoding issues.

Prevention:

Monitoring for feature extraction errors
Validation of upstream data formats
Integration tests with upstream services

This took 6 hours to debug. The lesson: ML bugs can hide in unexpected places.

Training and Retraining

How do you approach model retraining?

Models degrade over time as data distributions shift. Regular retraining is essential.

Our strategy:

Frequency:

Critical models: Weekly
Important models: Monthly
Stable models: Quarterly

Process:

Gather new training data
Train new model version
Evaluate against holdout set
Compare to production model
Deploy via standard rollout process

Automated vs. manual:

Non-critical models: Fully automated
Critical models: Human approval required
All models: Automated monitoring for issues

Challenges:

Training data labeling
Label quality and consistency
Concept drift in labels
Bias in new data

How do you handle training infrastructure?

Training is compute-intensive. Our approach:

Infrastructure:

Kubernetes-based training platform
GPU clusters (V100s and A100s)
Spot instances for cost savings
Distributed training for large models

Orchestration:

Kubeflow for pipeline management
MLflow for experiment tracking
DVC for data versioning
Weight & Biases for monitoring

Cost optimization:

Spot instances (70% cost savings)
Training during off-peak hours
Efficient hyperparameter search
Early stopping for poor models

Lessons Learned

What’s your biggest lesson from two years of production ML?

Start simple. Don’t use deep learning when logistic regression works. Don’t use GPUs when CPUs are fine. Don’t build custom infrastructure when existing tools work.

Every layer of complexity is ongoing maintenance burden. Every dependency is potential failure point. The simplest solution that meets requirements is usually best.

We’ve replaced complex deep learning models with simpler models multiple times. The complex model was 2% better on accuracy but 10x more expensive and 5x harder to maintain. Not worth it.

What would you do differently if starting over?

Invest in infrastructure earlier. We initially focused on model accuracy, treating infrastructure as an afterthought. This created technical debt that took months to pay down.

If starting over, I’d:

Build proper feature store from day one
Invest in monitoring and observability early
Establish data quality checks immediately
Create standard deployment pipelines upfront
Set up cost tracking from the beginning

The cost of retrofitting these is much higher than building them correctly initially.

What advice for companies starting ML in production?

Don’t start with ML. Seriously. Solve the problem without ML first. Many problems don’t need ML - rules, heuristics, or simple statistics often work fine.

If you need ML:

Start with simplest model that could work
Measure business impact, not just accuracy
Invest in data quality and pipelines
Build monitoring and observability
Plan for retraining and versioning
Budget for ongoing costs

Common mistakes to avoid:

Using latest research models in production
Neglecting data quality
Underestimating inference costs
Skipping proper monitoring
Not planning for model degradation

The Future of ML Infrastructure

Where is ML infrastructure headed?

Specialization: General-purpose infrastructure is giving way to specialized systems for different ML workloads. Image models need different infrastructure than LLMs.

Edge deployment: More inference moving to edge devices. Privacy, latency, and cost drive this. Infrastructure must support edge deployment.

Automated MLOps: Manual processes become automated. AutoML extends to AutoMLOps - automated training, deployment, monitoring, and retraining.

Cost optimization: As ML becomes ubiquitous, cost pressure increases. Infrastructure that optimizes cost while maintaining quality wins.

Observability: Understanding model behavior becomes more sophisticated. Explainability and interpretability tools become standard.

What are you most excited about?

I’m excited about infrastructure that makes ML accessible to more engineers. Right now, production ML requires specialized expertise. Better abstractions will democratize it.

Imagine deploying a model as easily as deploying a REST API. Monitoring, scaling, versioning, and cost optimization all handled automatically. Engineers focus on business logic, not infrastructure.

We’re not there yet, but we’re getting closer. That’s the future I’m working toward.

Final Thoughts

Any closing advice?

Production ML is more engineering than ML. Focus on fundamentals: data quality, monitoring, cost management, and operational excellence. The fancy model matters less than you think.

Be humble. Production will humble you quickly. Models that work great offline fail in production. The best accuracy on a benchmark doesn’t mean best business outcome.

And invest in your infrastructure. It’s the foundation everything else is built on. Cutting corners here creates technical debt that compounds over time.

Thank you to Alex Rodriguez for sharing insights from running ML infrastructure at scale. His experience reflects the reality of production ML systems beyond research papers and demos.

AI & ML

Building an AI Blog Writer: From Topic to Published Post with n8n, Claude, and GitHub

Developer skills

Cutting Cortex LLM Costs by 90%: The Prompt Engineering Playbook

Engineering

Watching Infrastructure Learn From Itself: A Claude Code Reflection

Enterprise software

Zero-Downtime Database Migrations

News & insights

From Idea to Production in 28 Days

Open Source

Personal AI Operations Memory: Building a Learning System for Git-Ops

Security

Concept: Homomorphic encryption techniques for secure computation on encrypted data