Challenges in building infrastructure for LLM inference
1. Fast scaling
Running LLM inference in production is a very different game from training models. Unlike training, which is batch-based and predictable, inference is driven by real-time user demand. That demand is often bursty, hard to predict, and unforgiving of latency or downtime.
This means the system needs to scale up quickly during traffic spikes and scale down to zero when idle to save costs. This kind of elasticity is fundamental to efficiency.
However, many organizations treat inference like training: they pre-allocate fixed GPU capacity through long-term commitments. This often leads to:
Over-provisioning: Wasted GPU capacity, high idle costs.
Under-provisioning: Dropped requests, latency spikes, and poor user experience.
Inflexible budgets: Rigid spending that doesn't adapt to real usage patterns.
Why serverless isn’t a silver bullet
The scaling problem seems familiar, one that serverless computing solved years ago. Platforms like AWS Lambda made it easy to scale to demand, but serverless doesn’t map well to AI workloads. Here’s why:
No GPU support: Most serverless platforms don’t support GPUs. This isn't merely a technical oversight; it's rooted in architectural and practical considerations.
GPUs can’t be sliced easily: GPUs, while powerful and highly parallelizable as devices, is not as flexible as CPUs in handling multiple inference tasks on different models simultaneously.
High cost of idle GPUs: They're the high-performance sports cars of the computing world, exceptional for specific tasks but costly to maintain, especially if not utilized continuously.
The cold start problem
Inference workloads need infrastructure that can scale quickly, manage costs, and stay performant. A fundamental challenge in scaling is the cold start.
In the context of deploying LLMs in containers, a cold start occurs when a Kubernetes node has never previously run a given deployment. As a result, the container image is not cached locally, and all image layers must be pulled and initialized from scratch.
This issue presents itself in three different stages:
Cloud provisioning: This step involves the time it takes for the cloud provider to allocate a new instance and attach it to the Kubernetes cluster. Depending on the instance type and availability, this can take anywhere from 30 seconds to several minutes, or even hours for high-demand GPUs like Nvidia A100 and H100.
Container image pulling: LLM images are significantly larger and more complex than typical Python job images, due to numerous dependencies and custom libraries. Despite claims of multi-gigabit bandwidth by cloud providers, actual image download speeds are often much slower. As a result, pulling images can take three to five minutes.
Model loading. The time required to load the model depends heavily on its size. LLMs introduce significant delays due to their billions of parameters. Key bottlenecks include:
Slow downloads from model hubs: Platforms like Hugging Face are not optimized for high-throughput, multi-part downloads, making the retrieval of large model files time-consuming.
Sequential data flow: Model files are transferred through multiple hops: remote storage → local disk → memory → GPU. This is minimal or no parallelization between them. Each step adds latency, particularly for large files that are difficult to cache or stream.
Lack of on-demand streaming: Model files must be fully downloaded and written to disk before inference can begin. This introduces additional I/O operations and delays startup.
Scaling metrics
Scaling infrastructure for LLM inference requires more than simply reacting to system load. Choosing the right metrics is critical for achieving responsive, efficient, and cost-effective scaling.
CPU utilization. It’s simple and comes with clear thresholds, but it doesn’t reflect real load for Python-based workloads. The Global Interpreter Lock (GIL) limits CPU parallelism, especially on multi-core machines, making this metric misleading for scaling decisions.
GPU utilization. A more relevant metric in theory, but inaccurate in practice. Tools like
nvmlreport GPUs as “utilized” if any kernel runs during a sample window—even briefly. This doesn’t account for batching or actual throughput, leading to premature scale-up or false confidence in capacity.QPS (queries per second). Widely used in traditional web services, but less useful for LLM inference. Generative requests vary greatly in size and compute cost, depending on input length and tokens generated. As a result, QPS lacks consistency and is hard to tune for auto-scaling.
Concurrency. This metric, which represents the number of active requests either queued or being processed, is an ideal measure for reflecting system load. Concurrency is easy to configure based on batch size and provides a direct correlation with actual system demands, allowing for precise scaling. However, for concurrency to work, you need support from a service framework to automatically instrument concurrency as a metric and serve it as a scaling signal for the deployment platform.
2. Build and maintenance cost
Building self-hosted LLM inference infrastructure isn’t just a technical task; it’s a costly, time-consuming commitment.
Complexity
LLM inference requires much more than standard cloud-native stacks can provide. Building the right setup involves:
Provisioning high-performance GPUs (often scarce and regionally limited)
Managing CUDA version compatibility and driver dependencies
Configuring autoscaling, concurrency control, and scale-to-zero behavior
Applying advanced inference optimization techniques such as prefix caching and prefill-decode disaggregation
Setting up observability tools for GPU monitoring, request tracing, and failure detection
Handling model-specific behaviors like streaming, caching, and routing
None of these steps are trivial. Most teams try to force-fit these needs onto general-purpose infrastructure, but it only results in reduced performance and longer lead time.
Even if a team pulls it off, every week spent setting up infrastructure is a week not spent improving models or delivering product value. For high-performing AI teams, this opportunity cost is just as real as the infrastructure bill.
Limited flexibility for ML tools and frameworks
Many AI stacks lock model runtimes, such as PyTorch, vLLM, or specific transformers, to fixed versions. The primary reason is to cache container images and ensure compatibility with infrastructure-related components. While this simplifies deployment in clusters, it also restricts flexibility when you need to test or deploy newer models or frameworks that fall outside the supported list.
But this rigidity creates real limitations:
You can’t easily test or deploy newer models or framework versions.
You inherit more tech debt as your stack diverges from community or vendor updates.
LLM deployment speed slows down, putting your team at a competitive disadvantage.
Scaling LLMs should mean exploring faster, better models, without being stuck waiting for infra to catch up.
Support for complex AI systems
An LLM alone doesn’t deliver value. It has to be part of an integrated system, often including:
Pre-processing to clean or transform user inputs
Post-processing to format model outputs for front-end use
Inference code that wraps the model in logic, pipelines, or control flow
Business logic to handle validation, rules, and internal data calls
Data fetchers to connect with databases or feature stores
Multi-model composition for retrieval-augmented generation or ensemble pipelines
Custom APIs to expose the service in the right shape for downstream teams
Here’s the catch: most LLM deployment tools aren’t built for this kind of extensibility. They’re designed to load weights and expose a basic API. Anything more complex requires glue code, workarounds, or splitting logic across multiple services.
That leads to:
More engineering effort just to deliver usable features
Poor developer experience for teams trying to consume these AI services
Blocked innovation when tools don’t support use-case-specific customization
The hidden cost: talent
LLM infrastructure requires deep specialization. Companies need engineers who understand GPUs, Kubernetes, ML frameworks, and distributed systems — all in one role. These professionals are rare and expensive, with salaries often 30–50% higher than traditional DevOps engineers.
Even for teams that have the right people, hiring and training to maintain in-house capabilities is a major investment. In this survey, over 60% of public sector IT professionals cited AI talent shortages as the biggest barrier to adoption. It’s no different in the private sector.
3. LLM observability
LLM observability is the practice of monitoring and understanding the behavior of LLM inference systems in production. It combines metrics, logs, and events across infrastructure, application, and model layers to provide end-to-end visibility. The goal is to detect issues early, explain why they occur, and ensure reliable, efficient, and high-quality model responses.
Without proper observability, diagnosing latency issues, scaling problems, or GPU underutilization becomes guesswork. Worse, unnoticed issues can degrade performance or break your service without warning.
What to measure
A production-ready observability stack for LLM inference spans multiple layers. Here's an example breakdown:
Container & Deployment
Pod status
Detects failed, stuck, or restarting Pods before they affect availability
Number of replicas
Verifies autoscaling behavior and helps troubleshoot scaling delays or limits
App Performance
Requests per second (RPS)
Measures incoming traffic and system load
Request latency
Helps identify response delays and bottlenecks
In-progress requests
Indicates concurrency pressure; reveals if the app is keeping up with demand
Error rate
Tracks failed or invalid responses; useful for SLA monitoring
Queue wait time
Reveals delays caused by waiting for an available replica
Cluster Resources
Resource quotas & limits
Tracks usage boundaries; helps tune requests/limits and avoid over/under-provisioning
LLM-Specific Metrics
Tokens per second
Reflects model throughput and performance efficiency
Time to first token
Affects user-perceived latency; critical for streaming or chat-like experiences
Total generation time
Measures end-to-end performance for full completions
GPU Metrics
GPU utilization
Shows how busy your GPUs are; low values may signal underuse or poor batching
GPU memory usage
Helps with capacity planning and avoiding OOM errors
Metrics tell you what is happening, but events and logs tell you why.
Events: Useful for tracking cluster activity like Pod restarts, scaling events, or scheduling delays.
Log aggregation: Centralized logs let you search across containers and time windows. This is vital for debugging request failures, identifying crashes, and tracing performance issues across services.
Last updated