Machine Learning Model Deployment Patterns Study
Machine Learning Model Deployment Patterns Study
Machine learning models deliver value only when deployed in production. Yet deployment patterns vary dramatically based on latency requirements, scale, cost constraints, and operational considerations.
We studied ML deployment patterns across 50 companies (from startups to enterprises) and 200+ production models. This research identifies common deployment patterns, their trade-offs, and guidance for choosing the right approach.
Research Methodology
Data Collection
Interviews:
- 50 companies across industries
- 75 ML engineers and infrastructure leads
- 30-60 minute structured interviews
- Questions about architecture, challenges, costs
System Analysis:
- Architectural diagrams
- Performance metrics
- Cost data
- Operational metrics
Use Cases Covered:
- Recommendation systems
- Content moderation
- Fraud detection
- Image and video processing
- Natural language processing
- Search and ranking
- Predictive analytics
Key Questions
Our research examined:
- What deployment patterns are most common?
- What factors drive pattern selection?
- What are typical performance characteristics?
- What operational challenges exist?
- What are cost structures?
- How do patterns evolve as companies scale?
Pattern 1: Batch Inference
Description
Models run on scheduled intervals, processing batches of data and storing predictions for later retrieval.
Architecture:
Scheduled Job → Load Data → Batch Inference → Store Predictions → Application Reads
When Used
Ideal for:
- Recommendations precomputed nightly
- Analytics and reporting
- Risk scoring (daily credit checks)
- Email campaign targeting
- Non-time-sensitive predictions
Requirements:
- Predictions can be stale (hours to days)
- Predictable workload patterns
- Cost sensitivity
Implementation Patterns
Simple batch:
- Cron job or orchestration tool (Airflow)
- Load data from database or data lake
- Run inference on entire dataset
- Store results in database
- Application queries cached predictions
Distributed batch:
- Spark or distributed processing framework
- Partition data across workers
- Parallel inference
- Aggregate and store results
- Scales to billions of predictions
Performance Characteristics
From our study:
Latency: N/A (predictions precomputed) Throughput: 1K-1M predictions per batch job Cost: $0.0001-$0.001 per prediction (amortized) Freshness: Hours to days old
Operational Considerations
Pros:
- Lowest cost per prediction
- Simple to operate
- Predictable resource usage
- Can use spot instances
Cons:
- Stale predictions
- All-or-nothing (if job fails, no predictions)
- Storage requirements for all predictions
- Cold start problem (new users have no predictions)
Real-World Example
E-commerce recommendation system:
- 10M users, 100K products
- Nightly batch job generates recommendations
- Top 50 recommendations per user stored
- 500M predictions generated nightly
- Spark job processes in 2 hours
- Cost: $150 per run ($4,500/month)
Pattern 2: Real-Time Serving
Description
Models deployed as services responding to synchronous requests with predictions computed on-demand.
Architecture:
Application → Load Balancer → Model Server → Response
When Used
Ideal for:
- User-facing features requiring fresh data
- Fraud detection (transaction time)
- Content moderation (as content posted)
- Dynamic pricing
- Real-time personalization
Requirements:
- Low latency (10-100ms)
- Predictions depend on recent data
- Moderate traffic volume
Implementation Patterns
REST API:
- HTTP endpoint
- JSON request/response
- Horizontal scaling with load balancer
- Standard monitoring and observability
gRPC:
- Binary protocol
- Lower latency than REST
- Type-safe interfaces
- Efficient serialization
Performance Characteristics
From our study:
Latency: 10-100ms (p99) Throughput: 100-10K RPS per instance Cost: $0.001-$0.01 per prediction Freshness: Real-time (uses latest features)
Operational Considerations
Pros:
- Fresh predictions using current data
- Handles new users immediately
- Flexible (can change model logic anytime)
- Natural scaling path
Cons:
- Higher cost per prediction
- Infrastructure complexity
- Latency requirements
- Scaling challenges at extreme traffic
Real-World Example
Fraud detection system:
- 5K transactions per second
- Predictions required within 50ms
- Kubernetes deployment with 20 model servers
- GPU acceleration for complex models
- Automatic scaling based on traffic
- Cost: $15K/month infrastructure
Pattern 3: Edge Deployment
Description
Models deployed to edge locations (CDN, mobile devices, IoT) for minimal latency and offline capability.
Architecture:
Edge Location → Local Model → Inference → Response
When Used
Ideal for:
- Mobile apps (offline capability)
- IoT devices (network constraints)
- Latency-critical applications (<10ms)
- Privacy-sensitive use cases
- Bandwidth-limited scenarios
Requirements:
- Ultra-low latency
- Offline functionality
- Privacy constraints
- Cost savings on cloud inference
Implementation Patterns
Mobile deployment:
- TensorFlow Lite, Core ML
- Model bundled in app
- On-device inference
- Periodic model updates
IoT deployment:
- Edge TPU, Intel Movidius
- Lightweight models
- Local inference
- Occasional cloud sync
CDN edge compute:
- Cloudflare Workers, AWS Lambda@Edge
- Models at edge locations
- Geographic distribution
- Low-latency serving
Performance Characteristics
From our study:
Latency: <10ms (local inference) Throughput: Device/edge location dependent Cost: Minimal (inference on user device) Freshness: Delayed (model updates periodic)
Operational Considerations
Pros:
- Lowest latency possible
- Offline capability
- Privacy (data never leaves device)
- Minimal cloud costs
Cons:
- Model size constraints
- Device capability limitations
- Model update complexity
- Version management challenges
- Difficult monitoring and debugging
Real-World Example
Image classification mobile app:
- 1M active users
- On-device inference (Core ML)
- 25MB model bundled with app
- Predictions <5ms
- Model updates quarterly via app update
- Cloud cost: $500/month (model storage only)
Pattern 4: Hybrid Batch + Real-Time
Description
Combines batch and real-time: precompute most predictions, compute personalized predictions on-demand.
Architecture:
Batch: Scheduled Job → Precompute → Cache
Real-time: Request → Check Cache → Compute if needed → Response
When Used
Ideal for:
- Recommendations (popular items batch, personalized real-time)
- Search ranking (base scores batch, personalization real-time)
- Risk scoring (base scores batch, real-time overrides)
Requirements:
- Partially cacheable predictions
- Balance between cost and freshness
- Moderate scale
Implementation Patterns
Tiered approach:
- Batch job computes base predictions
- Real-time service adds personalization
- Cache stores batch results
- Real-time queries cache, computes delta
Feature separation:
- Batch computes expensive features
- Real-time adds fresh features
- Model runs on combined features
- Balance cost and latency
Performance Characteristics
From our study:
Latency: 20-50ms (cache) or 100-200ms (compute) Cache hit rate: 60-90% Cost: $0.0005-$0.005 per prediction (blended) Freshness: Mixed (hours for batch, real-time for personalization)
Operational Considerations
Pros:
- Cost effective (batch) + responsive (real-time)
- Better freshness than pure batch
- Cheaper than pure real-time
- Flexible trade-offs
Cons:
- More complex architecture
- Cache invalidation challenges
- Monitoring complexity
- Requires good caching strategy
Real-World Example
Video streaming recommendations:
- 50M users
- Popular content: batch recommendations nightly
- Personalization: real-time based on current session
- 85% cache hit rate
- Latency: 30ms (cached) or 150ms (computed)
- Cost: $0.002 per prediction (blended)
- Infrastructure: $8K/month
Pattern 5: Stream Processing
Description
Models process data streams in real-time, producing predictions continuously as data arrives.
Architecture:
Data Stream → Stream Processor → Model Inference → Output Stream
When Used
Ideal for:
- Anomaly detection on metrics
- Real-time monitoring and alerting
- Click stream analysis
- IoT sensor data processing
- Log analysis
Requirements:
- Continuous data streams
- Real-time processing needed
- High throughput
- Tolerance for occasional latency spikes
Implementation Patterns
Stream processing frameworks:
- Kafka Streams, Flink, Spark Streaming
- Consume from message queue
- Stateful or stateless processing
- Output to downstream systems
Micro-batching:
- Accumulate events briefly
- Process in small batches
- Balance latency and throughput
- Efficient resource usage
Performance Characteristics
From our study:
Latency: 100ms-1s (including queuing) Throughput: 10K-1M events per second Cost: $0.0001-$0.001 per prediction Freshness: Near real-time
Operational Considerations
Pros:
- Handles high-throughput streams
- Scales horizontally
- Decouples producers and consumers
- Good for event-driven architectures
Cons:
- Operational complexity
- Requires stream processing infrastructure
- Harder to debug
- Backpressure management needed
Real-World Example
Security monitoring system:
- 100K events per second
- Anomaly detection on logs
- Flink processing pipeline
- 200ms end-to-end latency
- Scales to millions of events/day
- Cost: $12K/month
Pattern 6: Model Cascade
Description
Multiple models in sequence, with simpler models filtering before expensive models run.
Architecture:
Request → Simple Model → Filter → Complex Model → Response
When Used
Ideal for:
- Most cases handled by simple model
- Expensive model only for hard cases
- Cost optimization critical
- Two-stage ranking systems
Requirements:
- Hierarchical problem structure
- Clear filtering criteria
- Significant cost difference between models
Implementation Patterns
Binary filter:
- Fast model classifies as simple/complex
- Simple cases return immediately
- Complex cases escalate to large model
Confidence threshold:
- Fast model with confidence score
- High confidence → return prediction
- Low confidence → run expensive model
Progressive refinement:
- Cheap model gives rough answer
- Refine with expensive model if needed
- User can get fast or accurate result
Performance Characteristics
From our study:
Latency: 10-50ms (fast path) or 100-500ms (slow path) Fast path hit rate: 70-90% Cost reduction: 60-80% vs. expensive model alone Accuracy: Maintained or improved
Operational Considerations
Pros:
- Significant cost savings
- Most requests fast
- Maintains accuracy for complex cases
- Optimal resource usage
Cons:
- More complex pipeline
- Multiple models to maintain
- Tuning threshold requires experimentation
- Monitoring complexity
Real-World Example
Content moderation:
- Simple model (rules + small ML) handles 80% of content
- Complex model (large neural net) for ambiguous content
- 20ms latency for simple path
- 200ms latency for complex path
- Cost: $0.0005 per prediction (blended, down from $0.003)
- Accuracy maintained at 98%
Decision Framework
Choosing a Deployment Pattern
Question 1: What latency do you need?
- <10ms → Edge deployment
- 10-100ms → Real-time serving or hybrid
- 100ms-1s → Stream processing or real-time
-
1s or not time-sensitive → Batch
Question 2: What’s your traffic pattern?
- Bursty → Batch or hybrid
- Steady → Real-time serving
- Stream/event-based → Stream processing
- User-initiated → Real-time serving or edge
Question 3: What’s your scale?
- <100 RPS → Real-time serving
- 100-10K RPS → Real-time or hybrid
-
10K RPS → Hybrid, batch, or cascade
- Extreme scale → Batch or edge
Question 4: What’s your cost sensitivity?
- Very high → Batch or edge
- Moderate → Hybrid or cascade
- Low → Real-time serving
Question 5: How fresh do predictions need to be?
- Real-time data required → Real-time or stream
- Recent data acceptable → Hybrid
- Stale data okay → Batch
Evolution of Deployment Patterns
Companies typically evolve through stages:
Stage 1: Early (0-100K users)
Pattern: Simple real-time serving
- Single model server
- Basic scaling
- Minimal optimization
Why: Simplicity matters more than efficiency. Scale doesn’t justify complexity.
Stage 2: Growth (100K-1M users)
Pattern: Hybrid batch + real-time
- Batch for expensive computations
- Real-time for personalization
- Caching layer added
Why: Cost becomes concern. Hybrid balances cost and latency.
Stage 3: Scale (1M-10M users)
Pattern: Multiple patterns for different use cases
- Critical path: Real-time
- Expensive features: Batch
- Edge cases: Cascade
Why: Optimized patterns per use case. One-size-fits-all no longer works.
Stage 4: Maturity (10M+ users)
Pattern: Sophisticated, optimized architecture
- Edge deployment for ultra-low latency
- Multi-region serving
- Advanced caching strategies
- Automated cost optimization
Why: Scale demands optimization. Infrastructure investment justified.
Cost Analysis by Pattern
Based on serving 1 billion predictions per month:
Batch Inference
- Infrastructure: $2,000/month
- Storage: $500/month
- Total: $2,500/month
- Cost per prediction: $0.0025
Real-Time Serving
- Infrastructure: $25,000/month
- Total: $25,000/month
- Cost per prediction: $0.025
Edge Deployment
- Infrastructure: $500/month (model distribution)
- Device inference: $0 (user devices)
- Total: $500/month
- Cost per prediction: $0.0005
Hybrid Batch + Real-Time
- Batch infrastructure: $2,000/month
- Real-time infrastructure: $8,000/month
- Storage: $500/month
- Total: $10,500/month
- Cost per prediction: $0.0105
Stream Processing
- Infrastructure: $15,000/month
- Message queue: $3,000/month
- Total: $18,000/month
- Cost per prediction: $0.018
Model Cascade
- Simple model: $3,000/month
- Complex model: $8,000/month
- Total: $11,000/month
- Cost per prediction: $0.011
Emerging Trends
Serverless ML
Function-as-a-service for model inference:
- AWS Lambda, Google Cloud Functions
- Auto-scaling to zero
- Pay-per-invocation
Pros: Zero ops, perfect scaling Cons: Cold start latency, cost at scale
Adoption: Growing for low-traffic models
Multi-Model Serving
Single infrastructure serves multiple models:
- Shared resources
- Dynamic model loading
- Cost efficiency
Adoption: 35% of companies use this
Automated Model Updates
Continuous training and deployment:
- Models retrain automatically
- A/B test new versions
- Gradual rollout
Adoption: 25% have full automation
Key Findings
Most Common Pattern
Hybrid batch + real-time (38% of companies)
- Best balance of cost and responsiveness
- Flexibility to optimize per use case
- Natural scaling path
Biggest Challenge
Cost management (mentioned by 72% of respondents)
- Inference costs can exceed training costs 100x
- Optimization requires constant attention
- Trade-offs between performance and cost
Biggest Regret
Starting with real-time before needed (48% of respondents)
- Over-engineered early
- Batch would have sufficed initially
- Technical debt from premature complexity
Recommendations
For Startups
Start simple: Real-time serving or batch, depending on latency needs. Don’t over-engineer.
Optimize later: Wait until scale justifies optimization effort.
For Growing Companies
Adopt hybrid patterns: Balance cost and freshness.
Invest in infrastructure: Platform for model deployment pays dividends.
For Enterprise
Pattern per use case: Different patterns for different requirements.
Automation: Automate deployment, monitoring, retraining.
Cost optimization: Dedicated effort for cost management.
Conclusion
No single deployment pattern is universally best. The right choice depends on latency requirements, scale, cost constraints, and operational capabilities.
Key takeaways:
- Start simple, optimize as scale demands
- Hybrid patterns balance trade-offs effectively
- Cost management requires ongoing attention
- Patterns evolve as companies grow
- Measure and iterate
Choose the pattern that fits your requirements today, and be prepared to evolve as your needs change.
Research methodology and detailed case studies available at github.com/acme/ml-deployment-patterns