[Volume 32. Groq: LPU Architecture and the Redefinition of Inference Optimization Infrastructure]

Dec 25, 2025
13 min read

Executive Summary

Groq is a hardware startup that developed the Language Processing Unit (LPU), a new inference-dedicated chip architecture. Founded in 2016 by Jonathan Ross, a former member of Google's TPU development team, the company adopted a deterministic execution model to address fundamental limitations of GPU-based inference. As of September 2024, the company raised $750 million at a $6.9 billion valuation and provides API access to LPU infrastructure through its cloud service called GroqCloud.

On December 24, 2025, Nvidia announced a non-exclusive licensing agreement with Groq. According to CNBC reports, the deal is valued at $20 billion, as confirmed by Alex Davis, CEO of Disruptive, Groq's largest investor. Groq stated in an official blog post that it had "entered into a non-exclusive licensing agreement with Nvidia for Groq's inference technology," without disclosing the price. As part of the agreement, Groq founder and CEO Jonathan Ross, Groq President Sunny Madra, and other senior leaders will join Nvidia to help advance and scale the licensed technology.

Nvidia CEO Jensen Huang stated in an email to employees: "We plan to integrate Groq's low-latency processors into the NVIDIA AI factory architecture, extending the platform to serve an even broader range of AI inference and real-time workloads." Huang added, "While we are adding talented employees to our ranks and licensing Groq's IP, we are not acquiring Groq as a company."

Groq will continue to operate as an independent company with Simon Edwards stepping into the role of CEO. GroqCloud will continue to operate without interruption.

1. Groq's Technical Differentiation: LPU Architecture

1.1 Fundamental Differences Between GPU and LPU

GPU (Graphics Processing Unit)

Thousands of small cores for parallel processing
Non-deterministic execution: Thread scheduling determined dynamically at runtime
Memory hierarchy: L1/L2 cache, HBM (High Bandwidth Memory)
General-purpose architecture handling both training and inference

LPU (Language Processing Unit)

Deterministic execution model: All execution paths determined at compile time
Temporal architecture: Computing reorganized in the time dimension
Reduced external memory access: Maximizes use of on-chip SRAM
Inference-specific optimization: Specialized for Transformer models' sequential token generation

Groq's LPU is built on the Tensor Streaming Processor (TSP), which focuses on deterministic and predictable execution to enhance performance for language model inference tasks, unlike traditional CPU/GPU architectures.

Traditional processors such as CPUs and GPUs have many sources of non-determinism built into their design, such as memory hierarchy, interrupts, context switching, and dynamic instruction scheduling. Groq's LPU is designed to eliminate all sources of non-determinism, enabling the compiler to statically schedule the execution of each instruction along with the flow of data through the network of chips.

1.2 Memory Architecture Innovation

Groq on-chip SRAM provides memory bandwidth upwards of 80 terabytes/second, while GPU off-chip HBM clocks in at about eight terabytes/second. That difference alone gives LPUs up to a 10X speed advantage, on top of the boost LPUs get from not having to go back and forth to a separate memory chip to retrieve data.

The LPU integrates hundreds of MB of SRAM as primary weight storage (not cache), cutting latency and feeding compute units at full speed. This enables efficient tensor parallelism across chips, a practical advantage for fast, scalable inference.

Hardware Specifications

GroqChip: 189 TeraFlops, 230MB of on-die memory
GroqNode: 1.5 PetaFlops, 1.76GB of on-die memory, 8x GroqChip
GroqRack: 12 PetaFlops, 14GB global SRAM, 8x server with 64+1 card

Computing Performance

750 Tera Operations Per Second (TOPS) at INT8
188 TeraFLOPS at FP16
320×320 fused dot product matrix multiplication and 5,120 Vector ALUs

Memory Performance

230MB of on-chip SRAM per chip
80TB/s bandwidth
Minimized external memory access reduces power consumption

Groq's current chip set is built on a 14 nanometer process. As we move towards a 4 nanometer process, the performance advantages of LPU architecture will only increase.

1.3 Software-First Design Principle

The Groq LPU architecture started with the principle of software-first. The objective was to make the software developer's job of maximizing hardware utilization easier and put as much control as possible in the developer's hands.

GPUs are versatile and powerful, but they are also complex, putting extra burden on the software. It must account for variability in how a workload executes, within and across multiple chips, making scheduling runtime execution and maximizing hardware utilization much more challenging.

The Groq LPU was designed from the outset for linear algebra calculations—the primary requirement for AI inference. By limiting the focus to linear algebra compute and simplifying the multi-chip computation paradigm, Groq took a different approach to AI inference and chip design.

Software-first isn't just a design principle—it is actually how Groq built its first generation GroqChip processor. We didn't touch chip design until the compiler's architecture was designed.

1.4 Energy Efficiency

On-chip SRAM (230MB per chip, 80TB/s bandwidth) minimizes data movement, reducing power consumption to 1-3 joules per token, up to 10x more efficient than GPUs.

According to Groq, Nvidia GPUs require approximately 10 to 30 joules (J) to generate each token, whereas Groq only needs 1 to 3 joules.

1.5 Architectural Constraints

Groq LPU is not without its flaws, facing challenges in cost and versatility. The extensive clusters required to run large models incur high procurement and maintenance costs, while dedicated chips struggle to flexibly adapt to the rapidly evolving AI algorithms.

Since each Groq card has only 230MB of memory, running the Llama-2 70B model would require between 305 to 572 Groq cards, whereas only eight H100 cards would suffice.

Groq is one of a number of upstarts that do not use external high-bandwidth memory chips, freeing them from the memory crunch affecting the global chip industry. The approach, which uses a form of on-chip memory called SRAM, helps speed up interactions with chatbots and other AI models but also limits the size of the model that can be served.

2. AI Inference Chip Market Competitor Comparison

2.1 Overview of Major Competitors

The AI inference market has been centered around Nvidia GPUs, but competition has intensified since the mid-2020s with the emergence of specialized chip manufacturers adopting inference-dedicated architectures. Each vendor has adopted different technical approaches and market strategies.

Category	Groq LPU	Cerebras WSE-3	SambaNova SN40L	Google TPU v5e
Company Founded	2016	2016	2017	2015 (internal use)<br>2018 (external)
Founder	Jonathan Ross<br>(ex-Google TPU team)	Andrew Feldman<br>(ex-SeaMicro CEO)	Kunle Olukotun<br>(Stanford Professor)	Google
Architecture	Deterministic<br>Temporal Computing	Wafer-Scale<br>Integration	Reconfigurable<br>Dataflow Unit (RDU)	Tensor Processing<br>Unit (ASIC)
Chip Size	Standard chip size<br>(14nm process)	46,225 mm²<br>(entire wafer)	Standard chip size<br>(2 logic dies)	Standard chip size
Process Technology	Samsung 14nm	TSMC 5nm	TSMC 5nm	Google custom
Transistors	Undisclosed	4 trillion	Undisclosed	Undisclosed
Core Count	Undisclosed	900,000<br>AI-optimized cores	Undisclosed	1 TensorCore/chip<br>(v5e)
On-Chip Memory	230MB SRAM/chip	44GB SRAM	520MiB SRAM	16GB HBM/chip (v5e)
External Memory	None<br>(SRAM only)	None<br>(SRAM only)	64GB HBM<br>+ up to 1.5TB DDR	HBM2e/HBM3
Memory Bandwidth	80TB/s	21PB/s	>1TB/s<br>(DDR→HBM)	600GB/s (v5p)
Compute Performance	188 TFLOPs (FP16)<br>750 TOPS (INT8)	125 PFLOPs/chip	688 TFLOPs (FP16)	197 TFLOPs (BF16)
Power Consumption	275W/card	Undisclosed<br>(liquid cooling required)	600W (estimated)	200-250W/chip
Energy per Token	1-3 joules	Undisclosed	Undisclosed	Undisclosed
System Configuration	GroqRack:<br>12 PFLOPs<br>14GB global SRAM	CS-3:<br>up to 2,048 systems<br>256 ExaFLOPs	SN40L Node:<br>16 chips/rack<br>10.2 PFLOPs	TPU v5e Pod:<br>up to 256 chips<br>100 PetaOps (INT8)
Cooling Method	Air-cooled	Liquid-cooled	Air-cooled	Air-cooled
Cluster Scalability	Multi-chip network	Wafer-Scale Cluster:<br>up to 192 CS-2	8-socket node<br>peer-to-peer network	Multislice:<br>tens of thousands of chips
Primary Optimization	Sequential<br>token generation	Large-scale model<br>training & inference	CoE (Composition<br>of Experts)	Transformer<br>model optimization
Supported Model<br>Size	Llama 70B:<br>305-572 chips required	24 trillion parameters<br>(1 chip possible)	5 trillion parameters<br>(single node)	200B parameters<br>optimized (v5e)

2.2 Performance Comparison: Llama 3.1 70B Inference

According to SambaNova's comparative analysis, all three vendors demonstrate significantly better performance than GPU-based solutions for Llama 3.1 70B model inference:

Provider	Throughput (tokens/sec)	Chips/Racks	Precision
SambaNova	~461 (70B)<br>~132 (405B)	16 chips (1 rack)	Full FP16
Groq	~300-480	305-572 chips (9 racks)	Mixed FP16/FP8
Cerebras	~Similar performance	Multiple wafers	SRAM-based

SambaNova's 70B inference configuration uses just 16 chips of SN40L, with a combination of tensor parallelism across chips and pipeline parallelism within each chip. Each SN40L chip consists of two logic dies, HBM, and direct-attached DDR DRAM. The 16 chips are interconnected with a peer-to-peer network. They offer a compute roofline of 10.2 bf16 PFLOPS.

Despite having 10X more dies, a 49X higher compute roofline, and holding all the weights in SRAM, Cerebras achieves similar performance as SambaNova on Llama 3.1 70B, where SambaNova is slightly higher. Meanwhile, Groq needs 9X the rack space and 36X chips, yet still runs 46% slower than SambaNova on 70B.

2.3 Google TPU's Differentiated Approach

Cloud TPU v5e provides up to 2.3X price performance improvements over the previous generation TPU v4, making it our most cost-efficient TPU to date.

By contrast, Cloud TPU v5p is Google's most powerful TPU thus far. Each TPU v5p pod composes together 8,960 chips over the highest-bandwidth inter-chip interconnect (ICI) at 4,800 Gbps/chip in a 3D torus topology. Compared to TPU v4, TPU v5p features more than 2X greater FLOPS and 3X more high-bandwidth memory (HBM).

Designed for performance, flexibility, and scale, TPU v5p can train large LLM models 2.8X faster than the previous-generation TPU v4. Moreover, with second-generation SparseCores, TPU v5p can train embedding-dense models 1.9X faster than TPU v4.

TPU v5e Key Specifications:

Each v5e chip contains one TensorCore. Each TensorCore has four matrix-multiply units (MXUs), a vector unit, and a scalar unit
Each TPU v5e chip provides up to 393 trillion int8 operations per second, allowing complex models to make fast predictions. A TPU v5e pod delivers up to 100 quadrillion int8 operations per second, or 100 petaOps of compute power
TPU v5e delivers up to 2x higher training performance per dollar and up to 2.5x inference performance per dollar for LLMs and gen AI models compared to Cloud TPU v4

2.4 Cerebras' Unique Wafer-Scale Approach

Purpose built for training the industry's largest AI models, the 5nm-based, 4 trillion transistor WSE-3 powers the Cerebras CS-3 AI supercomputer, delivering 125 petaflops of peak AI performance through 900,000 AI optimized compute cores.

With a huge memory system of up to 1.2 petabytes, the CS-3 is designed to train next generation frontier models 10x larger than GPT-4 and Gemini.

In August 2024, Cerebras unveiled its AI inference service, claiming to be the fastest in the world and, in many cases, ten to twenty times faster than systems built using the dominant technology, Nvidia's H100 "Hopper" graphics processing unit, or GPU.

Square-shaped with 21.5 centimeters to a side, it uses nearly an entire 300-millimeter wafer of silicon to make one chip. Chipmaking equipment is typically limited to producing silicon dies of no more than about 800 square millimeters.

As many as 2,048 systems can be combined, a configuration that would chew through training the popular LLM Llama 70B from scratch in just one day.

2.5 Pros and Cons of Each Architecture

Groq LPU

Advantages:
- Low-latency inference (low TTFT)
- High energy efficiency (1-3 joules/token)
- Predictable performance through deterministic execution
- Easy deployment with air-cooled systems
Disadvantages:
- Limited on-chip memory (230MB) requires multiple chips for large models
- Training not possible (inference only)
- Lack of versatility

Cerebras WSE-3

Advantages:
- Industry's largest single chip
- Massive on-chip memory (44GB SRAM)
- Supports 24 trillion parameter models
- Supports both training and inference
Disadvantages:
- Liquid cooling system required
- High cost
- Complex yield management due to wafer-scale
- Large data center footprint required

SambaNova SN40L

Advantages:
- Three-tier memory system (SRAM + HBM + DDR)
- Flexible memory hierarchy
- Maintains full FP16 precision
- CoE architecture optimization
- Air-cooled system
Disadvantages:
- Proprietary platform (SambaNova Suite required)
- Limited ecosystem
- On-premises deployment complexity

Google TPU v5e

Advantages:
- Excellent price-performance ratio
- Google Cloud integration
- Transformer model optimization
- Massive scalability (Multislice)
- PyTorch, JAX, TensorFlow support
Disadvantages:
- Google Cloud dependency
- No on-premises deployment
- Different development environment from CUDA ecosystem

3. GroqCloud: Cloud Infrastructure Strategy

3.1 Service Structure

GroqCloud is an API-based inference service launched in public preview in early 2024.

API Interface

Provides OpenAI API-compatible endpoints
Supported models: Llama 3 (8B, 70B), Mixtral 8x7B, Gemma 7B, and other open-source models
JSON mode, function calling, streaming support

Performance Characteristics

Traditional accelerators achieve speed through aggressive quantization, forcing models into INT8 or lower precision numerics that introduce cumulative errors throughout the computation pipeline and lead to loss of quality. Groq uses TruePoint numerics, which changes this equation. TruePoint is an approach which reduces precision only in areas that do not reduce accuracy.

TruePoint format stores 100 bits of intermediate accumulation - sufficient range and precision to guarantee lossless accumulation regardless of input bit width.

3.2 Customer Cases and Real-World Usage

Perplexity AI

March 2024 announcement: Perplexity AI uses Groq as its inference infrastructure

Developer Community

Groq powers AI apps for more than 2 million developers (up from about 356,000 last year)

Revenue Target

2024 target: $500 million revenue
Sells chip access through the GroqCloud platform

4. Nvidia-Groq Transaction: Strategic Value Analysis

4.1 Transaction Structure and Background

Investment and Valuation History

Groq raised $750 million at a valuation of about $6.9 billion in September 2024. Investors in the round included BlackRock, Neuberger Berman, Samsung, Cisco, Altimeter and 1789 Capital.

Alex Davis's firm Disruptive has invested more than half a billion dollars in Groq since the company was founded in 2016.

Transaction Circumstances

Groq was not pursuing a sale when it was approached by Nvidia.

The deal represents by far Nvidia's largest purchase ever. The chipmaker's biggest acquisition to date came in 2019, when it bought Israeli chip designer Mellanox for close to $7 billion. At the end of October 2024, Nvidia had $60.6 billion in cash and short-term investments, up from $13.3 billion in early 2023.

Transaction Scope

Nvidia is getting all of Groq's assets, though its nascent Groq cloud business is not part of the transaction.

4.2 Similar Transaction Pattern

The deal follows a familiar pattern in recent years where the world's biggest technology firms pay large sums in deals with promising startups to take their technology and talent but stop short of formally acquiring the target.

Similar Cases:

Microsoft: Top AI executive came through a $650 million deal with a startup that was billed as a licensing fee
Meta: Spent $15 billion to hire Scale AI's CEO without acquiring the entire firm
Amazon: Hired away founders from Adept AI
Nvidia: Did a similar deal in September 2024, shelling out over $900 million to hire Enfabrica CEO Rochan Sankar and other employees, and to license the company's technology

4.3 Antitrust Considerations

Bernstein analyst Stacy Rasgon's analysis: "Antitrust would seem to be the primary risk here, though structuring the deal as a non-exclusive license may keep the fiction of competition alive (even as Groq's leadership and, we would presume, technical talent move over to Nvidia)."

Rasgon added that "Nvidia CEO Jensen Huang's relationship with the Trump administration appears among the strongest of the key US tech companies."

4.4 Strategic Assets Nvidia Acquires

1. Strengthening Competitiveness in the Inference Market

Groq specializes in what is known as inference, where artificial intelligence models that have already been trained respond to requests from users. While Nvidia dominates the market for training AI models, it faces much more competition in inference, where traditional rivals such as Advanced Micro Devices have aimed to challenge it as well as startups such as Groq and Cerebras Systems.

Groq's primary rival in the approach is Cerebras Systems, which Reuters reported in early December 2024 plans to go public as soon as next year. Groq and Cerebras have signed large deals in the Middle East.

Nvidia's Huang spent much of his biggest keynote speech of 2025 arguing that Nvidia would be able to maintain its lead as AI markets shift from training to inference.

2. Acquiring TPU Expertise

Groq was founded in 2016 by a group of former engineers, including Jonathan Ross, the company's CEO. Ross was one of the creators of Google's tensor processing unit, or TPU, the search giant's custom chip that's being used by some companies as an alternative to

Nvidia's graphics processing units.

In its initial filing with the SEC, announcing a $10.3 million fundraising in late 2016, Groq listed as principals Ross and Douglas Wightman, an entrepreneur and former engineer at the Google X "moonshot factory."

3. Low-Latency Inference Technology

Groq links together LPU-equipped servers into inference clusters using an internally developed interconnect called RealScale.

Processors use a clock to control the frequency at which their circuits carry out calculations. Usually, the clock is implemented as a tiny quartz crystal. Crystal-based drift is a phenomenon that causes the clock to unexpectedly slow its frequency, which can introduce inefficiencies into AI inference workflows. Groq says that RealScale can automatically adjust processor clocks to mitigate the issue.

4. Acquiring Technical Talent

As part of this agreement, Jonathan Ross, Groq's Founder, Sunny Madra, Groq's President, and other members of the Groq team will join Nvidia to help advance and scale the licensed technology.

4.5 Nvidia's Integration Plan and Expected Synergies

Expected Synergy Effects:

1. Workload Distribution Optimization

Training: Leverage GPU strengths (massive parallel processing, high memory capacity)
Inference: Leverage LPU strengths (low latency, high throughput, low energy consumption)
Provide customers with optimized hardware choices for each use case

2. Real-Time Inference Enhancement

Chatbots and conversational AI: Immediate response generation
Voice assistants: Support natural conversation flow
Autonomous driving: Enable real-time decision making
Financial trading: Meet millisecond-level inference requirements

3. Energy Efficiency Improvement

1-3 joules per token vs GPU's 10-30 joules
Reduced data center operating costs
Contributes to sustainability goals
Lower power consumption and cooling costs at scale deployment

4. Product Portfolio Completion

Addresses Nvidia's existing gap: overwhelming advantage in training, intensifying competition in inference
Provides complete AI pipeline: end-to-end solution from training to deployment
Responds to competitors: Addresses intensifying competition from inference-dedicated chips like AMD, AWS Inferentia, Google TPU

5. Market Dominance Expansion

Proactive response at the point when AI market shifts from training-centric to inference-centric
Neutralizes inference specialist competitors like Groq and Cerebras
Enables comprehensive solutions for cloud providers (AWS, Azure, GCP)

6. Technology Fusion Possibilities

Groq's deterministic execution + Nvidia's CUDA ecosystem
Groq's compiler technology + Nvidia's TensorRT
Hybrid architecture: Train complex models on GPU, deploy on LPU

4.6 Market Context

Competitive Landscape

AI chipmaker Cerebras Systems had planned to go public in 2024 but withdrew its IPO filing in October after announcing that it raised over $1 billion in a fundraising round. In a filing with the SEC, Cerebras said it does not intend to conduct a proposed offering "at this time," but didn't provide a reason.

Nvidia's Aggressive Investment Activity

In September 2024, Nvidia said it intended to invest up to $100 billion in OpenAI, with the startup committed to deploying at least 10 gigawatts of Nvidia products. That same month, Nvidia said it would invest $5 billion in Intel as part of a partnership.

5. Conclusion: Significance of the Transaction

This deal represents Nvidia's largest ever and marks approximately a 3x premium to Groq's September 2024 valuation of $6.9 billion.

What Nvidia Acquired Through This Deal:

Strengthened dominance in the AI inference market - Following training market monopoly, now dominating the inference market
Acquisition of core talent with Google TPU development experience - Jonathan Ross and core engineering team
Integration of low-latency inference technology into its product lineup - Expansion of AI factory architecture
Ability to provide comprehensive solutions in both training and inference - Established position as end-to-end AI infrastructure provider
Energy-efficient inference technology - Reduced data center operating costs and improved sustainability

Unique Structure of the Deal:

Groq will continue to operate as an independent company and GroqCloud will continue to operate without interruption, making this a strategic partnership that combines technology licensing and talent acquisition rather than a complete acquisition. This structure allows Nvidia to secure core technology and talent while avoiding antitrust regulations.

Industry Implications:

At a critical juncture when the AI market is shifting from training-centric to inference-centric, Nvidia has secured a preemptive advantage in next-generation AI infrastructure competition through this deal. Groq's LPU technology simultaneously meets the core requirements of real-time AI applications—low latency and high efficiency—and is expected to accelerate AI adoption across various domains including autonomous driving, robotics, real-time translation, and financial trading.