Skip to main content

Machine Learning Model Deployment Patterns Study

Ryan Dahlberg
Ryan Dahlberg
December 8, 2025 12 min read
Share:
Machine Learning Model Deployment Patterns Study

Machine Learning Model Deployment Patterns Study

Machine learning models deliver value only when deployed in production. Yet deployment patterns vary dramatically based on latency requirements, scale, cost constraints, and operational considerations.

We studied ML deployment patterns across 50 companies (from startups to enterprises) and 200+ production models. This research identifies common deployment patterns, their trade-offs, and guidance for choosing the right approach.

Research Methodology

Data Collection

Interviews:

  • 50 companies across industries
  • 75 ML engineers and infrastructure leads
  • 30-60 minute structured interviews
  • Questions about architecture, challenges, costs

System Analysis:

  • Architectural diagrams
  • Performance metrics
  • Cost data
  • Operational metrics

Use Cases Covered:

  • Recommendation systems
  • Content moderation
  • Fraud detection
  • Image and video processing
  • Natural language processing
  • Search and ranking
  • Predictive analytics

Key Questions

Our research examined:

  • What deployment patterns are most common?
  • What factors drive pattern selection?
  • What are typical performance characteristics?
  • What operational challenges exist?
  • What are cost structures?
  • How do patterns evolve as companies scale?

Pattern 1: Batch Inference

Description

Models run on scheduled intervals, processing batches of data and storing predictions for later retrieval.

Architecture:

Scheduled Job → Load Data → Batch Inference → Store Predictions → Application Reads

When Used

Ideal for:

  • Recommendations precomputed nightly
  • Analytics and reporting
  • Risk scoring (daily credit checks)
  • Email campaign targeting
  • Non-time-sensitive predictions

Requirements:

  • Predictions can be stale (hours to days)
  • Predictable workload patterns
  • Cost sensitivity

Implementation Patterns

Simple batch:

  • Cron job or orchestration tool (Airflow)
  • Load data from database or data lake
  • Run inference on entire dataset
  • Store results in database
  • Application queries cached predictions

Distributed batch:

  • Spark or distributed processing framework
  • Partition data across workers
  • Parallel inference
  • Aggregate and store results
  • Scales to billions of predictions

Performance Characteristics

From our study:

Latency: N/A (predictions precomputed) Throughput: 1K-1M predictions per batch job Cost: $0.0001-$0.001 per prediction (amortized) Freshness: Hours to days old

Operational Considerations

Pros:

  • Lowest cost per prediction
  • Simple to operate
  • Predictable resource usage
  • Can use spot instances

Cons:

  • Stale predictions
  • All-or-nothing (if job fails, no predictions)
  • Storage requirements for all predictions
  • Cold start problem (new users have no predictions)

Real-World Example

E-commerce recommendation system:

  • 10M users, 100K products
  • Nightly batch job generates recommendations
  • Top 50 recommendations per user stored
  • 500M predictions generated nightly
  • Spark job processes in 2 hours
  • Cost: $150 per run ($4,500/month)

Pattern 2: Real-Time Serving

Description

Models deployed as services responding to synchronous requests with predictions computed on-demand.

Architecture:

Application → Load Balancer → Model Server → Response

When Used

Ideal for:

  • User-facing features requiring fresh data
  • Fraud detection (transaction time)
  • Content moderation (as content posted)
  • Dynamic pricing
  • Real-time personalization

Requirements:

  • Low latency (10-100ms)
  • Predictions depend on recent data
  • Moderate traffic volume

Implementation Patterns

REST API:

  • HTTP endpoint
  • JSON request/response
  • Horizontal scaling with load balancer
  • Standard monitoring and observability

gRPC:

  • Binary protocol
  • Lower latency than REST
  • Type-safe interfaces
  • Efficient serialization

Performance Characteristics

From our study:

Latency: 10-100ms (p99) Throughput: 100-10K RPS per instance Cost: $0.001-$0.01 per prediction Freshness: Real-time (uses latest features)

Operational Considerations

Pros:

  • Fresh predictions using current data
  • Handles new users immediately
  • Flexible (can change model logic anytime)
  • Natural scaling path

Cons:

  • Higher cost per prediction
  • Infrastructure complexity
  • Latency requirements
  • Scaling challenges at extreme traffic

Real-World Example

Fraud detection system:

  • 5K transactions per second
  • Predictions required within 50ms
  • Kubernetes deployment with 20 model servers
  • GPU acceleration for complex models
  • Automatic scaling based on traffic
  • Cost: $15K/month infrastructure

Pattern 3: Edge Deployment

Description

Models deployed to edge locations (CDN, mobile devices, IoT) for minimal latency and offline capability.

Architecture:

Edge Location → Local Model → Inference → Response

When Used

Ideal for:

  • Mobile apps (offline capability)
  • IoT devices (network constraints)
  • Latency-critical applications (<10ms)
  • Privacy-sensitive use cases
  • Bandwidth-limited scenarios

Requirements:

  • Ultra-low latency
  • Offline functionality
  • Privacy constraints
  • Cost savings on cloud inference

Implementation Patterns

Mobile deployment:

  • TensorFlow Lite, Core ML
  • Model bundled in app
  • On-device inference
  • Periodic model updates

IoT deployment:

  • Edge TPU, Intel Movidius
  • Lightweight models
  • Local inference
  • Occasional cloud sync

CDN edge compute:

  • Cloudflare Workers, AWS Lambda@Edge
  • Models at edge locations
  • Geographic distribution
  • Low-latency serving

Performance Characteristics

From our study:

Latency: <10ms (local inference) Throughput: Device/edge location dependent Cost: Minimal (inference on user device) Freshness: Delayed (model updates periodic)

Operational Considerations

Pros:

  • Lowest latency possible
  • Offline capability
  • Privacy (data never leaves device)
  • Minimal cloud costs

Cons:

  • Model size constraints
  • Device capability limitations
  • Model update complexity
  • Version management challenges
  • Difficult monitoring and debugging

Real-World Example

Image classification mobile app:

  • 1M active users
  • On-device inference (Core ML)
  • 25MB model bundled with app
  • Predictions <5ms
  • Model updates quarterly via app update
  • Cloud cost: $500/month (model storage only)

Pattern 4: Hybrid Batch + Real-Time

Description

Combines batch and real-time: precompute most predictions, compute personalized predictions on-demand.

Architecture:

Batch: Scheduled Job → Precompute → Cache
Real-time: Request → Check Cache → Compute if needed → Response

When Used

Ideal for:

  • Recommendations (popular items batch, personalized real-time)
  • Search ranking (base scores batch, personalization real-time)
  • Risk scoring (base scores batch, real-time overrides)

Requirements:

  • Partially cacheable predictions
  • Balance between cost and freshness
  • Moderate scale

Implementation Patterns

Tiered approach:

  • Batch job computes base predictions
  • Real-time service adds personalization
  • Cache stores batch results
  • Real-time queries cache, computes delta

Feature separation:

  • Batch computes expensive features
  • Real-time adds fresh features
  • Model runs on combined features
  • Balance cost and latency

Performance Characteristics

From our study:

Latency: 20-50ms (cache) or 100-200ms (compute) Cache hit rate: 60-90% Cost: $0.0005-$0.005 per prediction (blended) Freshness: Mixed (hours for batch, real-time for personalization)

Operational Considerations

Pros:

  • Cost effective (batch) + responsive (real-time)
  • Better freshness than pure batch
  • Cheaper than pure real-time
  • Flexible trade-offs

Cons:

  • More complex architecture
  • Cache invalidation challenges
  • Monitoring complexity
  • Requires good caching strategy

Real-World Example

Video streaming recommendations:

  • 50M users
  • Popular content: batch recommendations nightly
  • Personalization: real-time based on current session
  • 85% cache hit rate
  • Latency: 30ms (cached) or 150ms (computed)
  • Cost: $0.002 per prediction (blended)
  • Infrastructure: $8K/month

Pattern 5: Stream Processing

Description

Models process data streams in real-time, producing predictions continuously as data arrives.

Architecture:

Data Stream → Stream Processor → Model Inference → Output Stream

When Used

Ideal for:

  • Anomaly detection on metrics
  • Real-time monitoring and alerting
  • Click stream analysis
  • IoT sensor data processing
  • Log analysis

Requirements:

  • Continuous data streams
  • Real-time processing needed
  • High throughput
  • Tolerance for occasional latency spikes

Implementation Patterns

Stream processing frameworks:

  • Kafka Streams, Flink, Spark Streaming
  • Consume from message queue
  • Stateful or stateless processing
  • Output to downstream systems

Micro-batching:

  • Accumulate events briefly
  • Process in small batches
  • Balance latency and throughput
  • Efficient resource usage

Performance Characteristics

From our study:

Latency: 100ms-1s (including queuing) Throughput: 10K-1M events per second Cost: $0.0001-$0.001 per prediction Freshness: Near real-time

Operational Considerations

Pros:

  • Handles high-throughput streams
  • Scales horizontally
  • Decouples producers and consumers
  • Good for event-driven architectures

Cons:

  • Operational complexity
  • Requires stream processing infrastructure
  • Harder to debug
  • Backpressure management needed

Real-World Example

Security monitoring system:

  • 100K events per second
  • Anomaly detection on logs
  • Flink processing pipeline
  • 200ms end-to-end latency
  • Scales to millions of events/day
  • Cost: $12K/month

Pattern 6: Model Cascade

Description

Multiple models in sequence, with simpler models filtering before expensive models run.

Architecture:

Request → Simple Model → Filter → Complex Model → Response

When Used

Ideal for:

  • Most cases handled by simple model
  • Expensive model only for hard cases
  • Cost optimization critical
  • Two-stage ranking systems

Requirements:

  • Hierarchical problem structure
  • Clear filtering criteria
  • Significant cost difference between models

Implementation Patterns

Binary filter:

  • Fast model classifies as simple/complex
  • Simple cases return immediately
  • Complex cases escalate to large model

Confidence threshold:

  • Fast model with confidence score
  • High confidence → return prediction
  • Low confidence → run expensive model

Progressive refinement:

  • Cheap model gives rough answer
  • Refine with expensive model if needed
  • User can get fast or accurate result

Performance Characteristics

From our study:

Latency: 10-50ms (fast path) or 100-500ms (slow path) Fast path hit rate: 70-90% Cost reduction: 60-80% vs. expensive model alone Accuracy: Maintained or improved

Operational Considerations

Pros:

  • Significant cost savings
  • Most requests fast
  • Maintains accuracy for complex cases
  • Optimal resource usage

Cons:

  • More complex pipeline
  • Multiple models to maintain
  • Tuning threshold requires experimentation
  • Monitoring complexity

Real-World Example

Content moderation:

  • Simple model (rules + small ML) handles 80% of content
  • Complex model (large neural net) for ambiguous content
  • 20ms latency for simple path
  • 200ms latency for complex path
  • Cost: $0.0005 per prediction (blended, down from $0.003)
  • Accuracy maintained at 98%

Decision Framework

Choosing a Deployment Pattern

Question 1: What latency do you need?

  • <10ms → Edge deployment
  • 10-100ms → Real-time serving or hybrid
  • 100ms-1s → Stream processing or real-time
  • 1s or not time-sensitive → Batch

Question 2: What’s your traffic pattern?

  • Bursty → Batch or hybrid
  • Steady → Real-time serving
  • Stream/event-based → Stream processing
  • User-initiated → Real-time serving or edge

Question 3: What’s your scale?

  • <100 RPS → Real-time serving
  • 100-10K RPS → Real-time or hybrid
  • 10K RPS → Hybrid, batch, or cascade

  • Extreme scale → Batch or edge

Question 4: What’s your cost sensitivity?

  • Very high → Batch or edge
  • Moderate → Hybrid or cascade
  • Low → Real-time serving

Question 5: How fresh do predictions need to be?

  • Real-time data required → Real-time or stream
  • Recent data acceptable → Hybrid
  • Stale data okay → Batch

Evolution of Deployment Patterns

Companies typically evolve through stages:

Stage 1: Early (0-100K users)

Pattern: Simple real-time serving

  • Single model server
  • Basic scaling
  • Minimal optimization

Why: Simplicity matters more than efficiency. Scale doesn’t justify complexity.

Stage 2: Growth (100K-1M users)

Pattern: Hybrid batch + real-time

  • Batch for expensive computations
  • Real-time for personalization
  • Caching layer added

Why: Cost becomes concern. Hybrid balances cost and latency.

Stage 3: Scale (1M-10M users)

Pattern: Multiple patterns for different use cases

  • Critical path: Real-time
  • Expensive features: Batch
  • Edge cases: Cascade

Why: Optimized patterns per use case. One-size-fits-all no longer works.

Stage 4: Maturity (10M+ users)

Pattern: Sophisticated, optimized architecture

  • Edge deployment for ultra-low latency
  • Multi-region serving
  • Advanced caching strategies
  • Automated cost optimization

Why: Scale demands optimization. Infrastructure investment justified.

Cost Analysis by Pattern

Based on serving 1 billion predictions per month:

Batch Inference

  • Infrastructure: $2,000/month
  • Storage: $500/month
  • Total: $2,500/month
  • Cost per prediction: $0.0025

Real-Time Serving

  • Infrastructure: $25,000/month
  • Total: $25,000/month
  • Cost per prediction: $0.025

Edge Deployment

  • Infrastructure: $500/month (model distribution)
  • Device inference: $0 (user devices)
  • Total: $500/month
  • Cost per prediction: $0.0005

Hybrid Batch + Real-Time

  • Batch infrastructure: $2,000/month
  • Real-time infrastructure: $8,000/month
  • Storage: $500/month
  • Total: $10,500/month
  • Cost per prediction: $0.0105

Stream Processing

  • Infrastructure: $15,000/month
  • Message queue: $3,000/month
  • Total: $18,000/month
  • Cost per prediction: $0.018

Model Cascade

  • Simple model: $3,000/month
  • Complex model: $8,000/month
  • Total: $11,000/month
  • Cost per prediction: $0.011

Serverless ML

Function-as-a-service for model inference:

  • AWS Lambda, Google Cloud Functions
  • Auto-scaling to zero
  • Pay-per-invocation

Pros: Zero ops, perfect scaling Cons: Cold start latency, cost at scale

Adoption: Growing for low-traffic models

Multi-Model Serving

Single infrastructure serves multiple models:

  • Shared resources
  • Dynamic model loading
  • Cost efficiency

Adoption: 35% of companies use this

Automated Model Updates

Continuous training and deployment:

  • Models retrain automatically
  • A/B test new versions
  • Gradual rollout

Adoption: 25% have full automation

Key Findings

Most Common Pattern

Hybrid batch + real-time (38% of companies)

  • Best balance of cost and responsiveness
  • Flexibility to optimize per use case
  • Natural scaling path

Biggest Challenge

Cost management (mentioned by 72% of respondents)

  • Inference costs can exceed training costs 100x
  • Optimization requires constant attention
  • Trade-offs between performance and cost

Biggest Regret

Starting with real-time before needed (48% of respondents)

  • Over-engineered early
  • Batch would have sufficed initially
  • Technical debt from premature complexity

Recommendations

For Startups

Start simple: Real-time serving or batch, depending on latency needs. Don’t over-engineer.

Optimize later: Wait until scale justifies optimization effort.

For Growing Companies

Adopt hybrid patterns: Balance cost and freshness.

Invest in infrastructure: Platform for model deployment pays dividends.

For Enterprise

Pattern per use case: Different patterns for different requirements.

Automation: Automate deployment, monitoring, retraining.

Cost optimization: Dedicated effort for cost management.

Conclusion

No single deployment pattern is universally best. The right choice depends on latency requirements, scale, cost constraints, and operational capabilities.

Key takeaways:

  1. Start simple, optimize as scale demands
  2. Hybrid patterns balance trade-offs effectively
  3. Cost management requires ongoing attention
  4. Patterns evolve as companies grow
  5. Measure and iterate

Choose the pattern that fits your requirements today, and be prepared to evolve as your needs change.


Research methodology and detailed case studies available at github.com/acme/ml-deployment-patterns

#Research #Machine Learning #MLOps #Deployment #Architecture #Production Systems