Where is LLM inference run?

When deploying LLMs into production, choosing the right hardware is crucial. Different hardware types offer varied levels of performance and cost-efficiency. The four primary options are CPUs, GPUs, TPUs and LPUs. Understanding their strengths and weaknesses helps you optimize your inference workloads effectively.

CPUs

Central Processing Units (CPUs) are general-purpose processors used in all computers and servers. CPUs are widely available and suitable for running small models or serving infrequent requests. However, they lack the parallel processing power to run LLMs efficiently. For production-grade LLM inference, especially with larger models or high request volumes, CPUs often fall short in both latency and throughput.

GPUs

Graphics Processing Units (GPUs) were originally designed for graphics rendering and digital visualization tasks. As they could perform highly parallel operations, they also turned out to be a great fit for ML and AI workloads. Today, GPUs are the default choice for both training and inference of GenAI like LLMs.

The architecture of GPUs is optimized for matrix multiplication and tensor operations, which are core components of transformer-based models. Modern inference frameworks and runtimes (e.g., vLLM, SGLang, LMDeploy, TensorRT-LLM, and Hugging Face TGI) are designed to take full advantage of GPU acceleration.

TPUs

Tensor Processing Units (TPUs) are custom-built by Google to accelerate AI workloads like training and inference. Compared with GPUs, TPUs are designed from the ground up for tensor operations — the fundamental math behind neural networks. This specialization makes TPUs faster and more efficient than GPUs for many AI-based compute tasks, like LLM inference.

TPUs are behind some of the most advanced AI applications today: agents, recommendation systems and personalization, image, video & audio synthesis, and more. Google uses TPUs in Search, Photos, Maps, and to power Gemini and DeepMind models.

LPUs

Language Processing Units (LPUs) are purpose-built processors specifically designed by Groqarrow-up-right to accelerate large language model (LLM) inference with unprecedented speed and efficiency. Unlike GPUs that were originally designed for graphics rendering and later adapted for AI workloads, LPUs are engineered from the ground up exclusively for the sequential token generation and matrix operations that characterize language model inference.

Architecture and Design Principles

The LPU's revolutionary performance stems from four core architectural innovations:

1. Deterministic Execution LPUs eliminate sources of non-determinism found in traditional processors (such as cache misses, branch prediction, and dynamic scheduling). Every instruction executes with predictable timing down to the clock cycle, enabling the compiler to statically schedule all operations with perfect precision. This determinism eliminates bottlenecks and ensures consistent, predictable latency.

2. Software-First Philosophy While GPUs require complex, model-specific kernel programming to achieve optimal performance, LPUs use a model-independent compiler that takes complete control of hardware utilization. The hardware was designed after the compiler architecture was finalized—putting software truly first. This approach dramatically simplifies deployment and allows developers to maximize performance without hand-tuning kernels for each model variant.

3. On-Chip SRAM Memory LPUs integrate massive amounts of high-bandwidth SRAM (Static Random Access Memory) directly on the chip, delivering memory bandwidth exceeding 80 TB/s—roughly 10x faster than GPU off-chip HBM (High Bandwidth Memory) at ~8 TB/s. This eliminates the latency and energy overhead of shuttling data between separate memory chips, directly accelerating the memory-bound operations that dominate LLM inference.

4. Programmable Assembly Line Architecture The LPU features a unique "conveyor belt" design where data and instructions flow continuously through specialized functional units arranged in vertical slices (matrix operations, vector operations, memory access, etc.). This streaming architecture enables instruction execution within a chip and seamlessly between multiple chips with no synchronization overhead or resource contention—functioning like a perfectly orchestrated assembly line where every component knows exactly when data will arrive.

Performance Advantages

The architectural innovations translate into substantial real-world benefits:

  • Ultra-Low Latency: Token generation speeds measured in milliseconds rather than seconds, enabling real-time conversational AI experiences

  • Energy Efficiency: Up to 10x more energy-efficient than GPUs for LLM inference workloads, significantly reducing operational costs

  • Predictable Performance: Deterministic execution guarantees worst-case performance bounds, critical for production SLA requirements

  • Massive Scalability: Software-scheduled networking enables thousands of LPU chips to operate as a single coordinated system with minimal inter-chip communication overhead

Limitations and Considerations

While LPUs excel at LLM inference, they have specific constraints:

  • Specialized Workloads: Optimized specifically for transformer-based language models and sequential token generation; not suitable for general-purpose computing or graphics tasks

  • Limited Availability: Currently available primarily through Groq's cloud platform (GroqCloud), with limited access to on-premises hardware compared to widely available GPUs

  • Ecosystem Maturity: Smaller software ecosystem and community compared to mature GPU frameworks like CUDA, though major frameworks (PyTorch, TensorFlow) are supported

  • Static Compilation: The compiler must know the computation graph ahead of time, making LPUs less flexible for dynamic or highly variable workloads

Use Cases

LPUs are ideal for production scenarios where inference latency directly impacts user experience:

  • Real-time chatbots and conversational AI requiring instant responses

  • High-throughput API services handling thousands of concurrent requests

  • Voice assistants and interactive applications where sub-second latency is critical

  • Production environments prioritizing cost-per-token efficiency at scale

In summary, while GPUs remain versatile workhorses for both training and inference, and TPUs offer Google-ecosystem optimization, LPUs represent the cutting edge of purpose-built inference acceleration—trading general-purpose flexibility for unmatched speed and efficiency in language model deployment. For organizations prioritizing inference performance and operating at scale, LPUs offer compelling advantages that traditional hardware cannot match.


Here is a side-by-side comparison:

Design purpose

General-purpose computing

Parallel computing for graphics and deep learning

Optimized for dense tensor operations

Optimized for LLM inference with sequential token generation

Core strength

Flexibility, handles many types of tasks

Large-scale parallelism, great for training and inference

Extreme efficiency for tensor operations

Ultra-low latency, deterministic execution for language models

Parallelism

Low

High

Very high

Extremely high (streaming architecture)

Best for

Branching logic, small workloads, classical apps

Training and serving LLMs, image models, video tasks

Large-scale training and high-throughput inference

Real-time LLM inference, conversational AI, production API services

Memory type

DRAM

GDDR / HBM

HBM

On-chip SRAM

Memory bandwidth

Low

High

Very high

Extremely high (80+ TB/s)

Latency

Low latency per core

Higher latency but amortized by parallelism

Low latency on matrix ops

Ultra-low latency for token generation (sub-millisecond per token)

Power efficiency

Moderate

Moderate to high

Very high for ML workloads

Very high for LLM inference (up to 10x more efficient than GPU)

Software ecosystem

Mature, universal

CUDA, ROCm, PyTorch, TensorFlow

XLA, JAX, TensorFlow

Emerging, supports PyTorch, TensorFlow; Groq compiler

Cost

Low

Medium to high

High, available mainly through cloud

Medium to high, primarily cloud-based (GroqCloud)

Scalability

Limited for deep learning

Scales well across multi-GPU setups; LLMs face cold start problems

Strong scaling in TPU pods

Excellent scaling via software-scheduled networking across thousands of chips

Example use cases

Data preprocessing, backend services, running small local models

LLM training and inference

Large-batch training, production inference at Google

Real-time chatbots, low-latency API services, voice assistants, Groq inference platform

Choosing the right hardware for your LLM inference

Selecting the appropriate hardware requires you to understand your model size, inference volume, latency requirements, cost constraints, and available infrastructure. GPUs remain the most popular choice due to their versatility and broad support, while TPUs offer compelling advantages for certain specialized scenarios, and CPUs still have a place for lightweight, budget-conscious workloads.

Choosing the right deployment pattern

The deployment pattern shapes everything from latency and scalability to privacy and cost. Each pattern suits different operational needs for enterprises.

  • Cloud: The cloud is the most popular environment for LLM inference today. It offers on-demand access to high-performance GPUs and TPUs, along with a rich ecosystem of managed services, autoscaling, and monitoring tools.

  • Multi-cloud and cross-region: This flexible deployment strategy distributes LLM workloads across multiple cloud providers or geographic regions. It helps reduce latency for global users, improves GPU availability, optimizes compute costs, mitigates vendor lock-in, and supports compliance with data residency requirements.

  • Bring Your Own Cloud (BYOC): BYOC deployments let you run vendor software, such as an LLM inference platform, directly inside your own cloud account. This model combines managed orchestration with full data, network, and cost control. It's ideal for enterprises that need compliance, cost-efficiency, and scalability without full self-hosting.

  • On-Prem: On-premises deployments means running LLM inference on your own infrastructure, typically within a private data center. It offers full control over data, performance, and compliance, but requires more operational overhead.

  • Edge: In edge deployments, the model runs directly on user devices or local edge nodes, closer to where data is generated. This reduces network latency and increases data privacy, especially for time-sensitive or offline use cases. Edge inference usually uses smaller, optimized models due to limited compute resources.

Last updated