[Volume 32. Groq: LPU Architecture and the Redefinition of Inference Optimization Infrastructure]
- Dec 25, 2025
- 13 min read
Executive Summary
Groq is a hardware startup that developed the Language Processing Unit (LPU), a new inference-dedicated chip architecture. Founded in 2016 by Jonathan Ross, a former member of Google's TPU development team, the company adopted a deterministic execution model to address fundamental limitations of GPU-based inference. As of September 2024, the company raised $750 million at a $6.9 billion valuation and provides API access to LPU infrastructure through its cloud service called GroqCloud.
On December 24, 2025, Nvidia announced a non-exclusive licensing agreement with Groq. According to CNBC reports, the deal is valued at $20 billion, as confirmed by Alex Davis, CEO of Disruptive, Groq's largest investor. Groq stated in an official blog post that it had "entered into a non-exclusive licensing agreement with Nvidia for Groq's inference technology," without disclosing the price. As part of the agreement, Groq founder and CEO Jonathan Ross, Groq President Sunny Madra, and other senior leaders will join Nvidia to help advance and scale the licensed technology.
Nvidia CEO Jensen Huang stated in an email to employees: "We plan to integrate Groq's low-latency processors into the NVIDIA AI factory architecture, extending the platform to serve an even broader range of AI inference and real-time workloads." Huang added, "While we are adding talented employees to our ranks and licensing Groq's IP, we are not acquiring Groq as a company."
Groq will continue to operate as an independent company with Simon Edwards stepping into the role of CEO. GroqCloud will continue to operate without interruption.
1. Groq's Technical Differentiation: LPU Architecture
1.1 Fundamental Differences Between GPU and LPU
GPU (Graphics Processing Unit)
Thousands of small cores for parallel processing
Non-deterministic execution: Thread scheduling determined dynamically at runtime
Memory hierarchy: L1/L2 cache, HBM (High Bandwidth Memory)
General-purpose architecture handling both training and inference
LPU (Language Processing Unit)
Deterministic execution model: All execution paths determined at compile time
Temporal architecture: Computing reorganized in the time dimension
Reduced external memory access: Maximizes use of on-chip SRAM
Inference-specific optimization: Specialized for Transformer models' sequential token generation
Groq's LPU is built on the Tensor Streaming Processor (TSP), which focuses on deterministic and predictable execution to enhance performance for language model inference tasks, unlike traditional CPU/GPU architectures.
Traditional processors such as CPUs and GPUs have many sources of non-determinism built into their design, such as memory hierarchy, interrupts, context switching, and dynamic instruction scheduling. Groq's LPU is designed to eliminate all sources of non-determinism, enabling the compiler to statically schedule the execution of each instruction along with the flow of data through the network of chips.
1.2 Memory Architecture Innovation
Groq on-chip SRAM provides memory bandwidth upwards of 80 terabytes/second, while GPU off-chip HBM clocks in at about eight terabytes/second. That difference alone gives LPUs up to a 10X speed advantage, on top of the boost LPUs get from not having to go back and forth to a separate memory chip to retrieve data.
The LPU integrates hundreds of MB of SRAM as primary weight storage (not cache), cutting latency and feeding compute units at full speed. This enables efficient tensor parallelism across chips, a practical advantage for fast, scalable inference.
Hardware Specifications
GroqChip: 189 TeraFlops, 230MB of on-die memory
GroqNode: 1.5 PetaFlops, 1.76GB of on-die memory, 8x GroqChip
GroqRack: 12 PetaFlops, 14GB global SRAM, 8x server with 64+1 card
Computing Performance
750 Tera Operations Per Second (TOPS) at INT8
188 TeraFLOPS at FP16
320×320 fused dot product matrix multiplication and 5,120 Vector ALUs
Memory Performance
230MB of on-chip SRAM per chip
80TB/s bandwidth
Minimized external memory access reduces power consumption
Groq's current chip set is built on a 14 nanometer process. As we move towards a 4 nanometer process, the performance advantages of LPU architecture will only increase.
1.3 Software-First Design Principle
The Groq LPU architecture started with the principle of software-first. The objective was to make the software developer's job of maximizing hardware utilization easier and put as much control as possible in the developer's hands.
GPUs are versatile and powerful, but they are also complex, putting extra burden on the software. It must account for variability in how a workload executes, within and across multiple chips, making scheduling runtime execution and maximizing hardware utilization much more challenging.
The Groq LPU was designed from the outset for linear algebra calculations—the primary requirement for AI inference. By limiting the focus to linear algebra compute and simplifying the multi-chip computation paradigm, Groq took a different approach to AI inference and chip design.
Software-first isn't just a design principle—it is actually how Groq built its first generation GroqChip processor. We didn't touch chip design until the compiler's architecture was designed.
1.4 Energy Efficiency
On-chip SRAM (230MB per chip, 80TB/s bandwidth) minimizes data movement, reducing power consumption to 1-3 joules per token, up to 10x more efficient than GPUs.
According to Groq, Nvidia GPUs require approximately 10 to 30 joules (J) to generate each token, whereas Groq only needs 1 to 3 joules.
1.5 Architectural Constraints
Groq LPU is not without its flaws, facing challenges in cost and versatility. The extensive clusters required to run large models incur high procurement and maintenance costs, while dedicated chips struggle to flexibly adapt to the rapidly evolving AI algorithms.
Since each Groq card has only 230MB of memory, running the Llama-2 70B model would require between 305 to 572 Groq cards, whereas only eight H100 cards would suffice.
Groq is one of a number of upstarts that do not use external high-bandwidth memory chips, freeing them from the memory crunch affecting the global chip industry. The approach, which uses a form of on-chip memory called SRAM, helps speed up interactions with chatbots and other AI models but also limits the size of the model that can be served.
2. AI Inference Chip Market Competitor Comparison
2.1 Overview of Major Competitors
The AI inference market has been centered around Nvidia GPUs, but competition has intensified since the mid-2020s with the emergence of specialized chip manufacturers adopting inference-dedicated architectures. Each vendor has adopted different technical approaches and market strategies.
Category | Groq LPU | Cerebras WSE-3 | SambaNova SN40L | Google TPU v5e |
Company Founded | 2016 | 2016 | 2017 | 2015 (internal use)<br>2018 (external) |
Founder | Jonathan Ross<br>(ex-Google TPU team) | Andrew Feldman<br>(ex-SeaMicro CEO) | Kunle Olukotun<br>(Stanford Professor) | |
Architecture | Deterministic<br>Temporal Computing | Wafer-Scale<br>Integration | Reconfigurable<br>Dataflow Unit (RDU) | Tensor Processing<br>Unit (ASIC) |
Chip Size | Standard chip size<br>(14nm process) | 46,225 mm²<br>(entire wafer) | Standard chip size<br>(2 logic dies) | Standard chip size |
Process Technology | Samsung 14nm | TSMC 5nm | TSMC 5nm | Google custom |
Transistors | Undisclosed | 4 trillion | Undisclosed | Undisclosed |
Core Count | Undisclosed | 900,000<br>AI-optimized cores | Undisclosed | 1 TensorCore/chip<br>(v5e) |
On-Chip Memory | 230MB SRAM/chip | 44GB SRAM | 520MiB SRAM | 16GB HBM/chip (v5e) |
External Memory | None<br>(SRAM only) | None<br>(SRAM only) | 64GB HBM<br>+ up to 1.5TB DDR | HBM2e/HBM3 |
Memory Bandwidth | 80TB/s | 21PB/s | >1TB/s<br>(DDR→HBM) | 600GB/s (v5p) |
Compute Performance | 188 TFLOPs (FP16)<br>750 TOPS (INT8) | 125 PFLOPs/chip | 688 TFLOPs (FP16) | 197 TFLOPs (BF16) |
Power Consumption | 275W/card | Undisclosed<br>(liquid cooling required) | 600W (estimated) | 200-250W/chip |
Energy per Token | 1-3 joules | Undisclosed | Undisclosed | Undisclosed |
System Configuration | GroqRack:<br>12 PFLOPs<br>14GB global SRAM | CS-3:<br>up to 2,048 systems<br>256 ExaFLOPs | SN40L Node:<br>16 chips/rack<br>10.2 PFLOPs | TPU v5e Pod:<br>up to 256 chips<br>100 PetaOps (INT8) |
Cooling Method | Air-cooled | Liquid-cooled | Air-cooled | Air-cooled |
Cluster Scalability | Multi-chip network | Wafer-Scale Cluster:<br>up to 192 CS-2 | 8-socket node<br>peer-to-peer network | Multislice:<br>tens of thousands of chips |
Primary Optimization | Sequential<br>token generation | Large-scale model<br>training & inference | CoE (Composition<br>of Experts) | Transformer<br>model optimization |
Supported Model<br>Size | Llama 70B:<br>305-572 chips required | 24 trillion parameters<br>(1 chip possible) | 5 trillion parameters<br>(single node) | 200B parameters<br>optimized (v5e) |
2.2 Performance Comparison: Llama 3.1 70B Inference
According to SambaNova's comparative analysis, all three vendors demonstrate significantly better performance than GPU-based solutions for Llama 3.1 70B model inference:
Provider | Throughput (tokens/sec) | Chips/Racks | Precision |
SambaNova | ~461 (70B)<br>~132 (405B) | 16 chips (1 rack) | Full FP16 |
Groq | ~300-480 | 305-572 chips (9 racks) | Mixed FP16/FP8 |
Cerebras | ~Similar performance | Multiple wafers | SRAM-based |
SambaNova's 70B inference configuration uses just 16 chips of SN40L, with a combination of tensor parallelism across chips and pipeline parallelism within each chip. Each SN40L chip consists of two logic dies, HBM, and direct-attached DDR DRAM. The 16 chips are interconnected with a peer-to-peer network. They offer a compute roofline of 10.2 bf16 PFLOPS.
Despite having 10X more dies, a 49X higher compute roofline, and holding all the weights in SRAM, Cerebras achieves similar performance as SambaNova on Llama 3.1 70B, where SambaNova is slightly higher. Meanwhile, Groq needs 9X the rack space and 36X chips, yet still runs 46% slower than SambaNova on 70B.
2.3 Google TPU's Differentiated Approach
Cloud TPU v5e provides up to 2.3X price performance improvements over the previous generation TPU v4, making it our most cost-efficient TPU to date.
By contrast, Cloud TPU v5p is Google's most powerful TPU thus far. Each TPU v5p pod composes together 8,960 chips over the highest-bandwidth inter-chip interconnect (ICI) at 4,800 Gbps/chip in a 3D torus topology. Compared to TPU v4, TPU v5p features more than 2X greater FLOPS and 3X more high-bandwidth memory (HBM).
Designed for performance, flexibility, and scale, TPU v5p can train large LLM models 2.8X faster than the previous-generation TPU v4. Moreover, with second-generation SparseCores, TPU v5p can train embedding-dense models 1.9X faster than TPU v4.
TPU v5e Key Specifications:
Each v5e chip contains one TensorCore. Each TensorCore has four matrix-multiply units (MXUs), a vector unit, and a scalar unit
Each TPU v5e chip provides up to 393 trillion int8 operations per second, allowing complex models to make fast predictions. A TPU v5e pod delivers up to 100 quadrillion int8 operations per second, or 100 petaOps of compute power
TPU v5e delivers up to 2x higher training performance per dollar and up to 2.5x inference performance per dollar for LLMs and gen AI models compared to Cloud TPU v4
2.4 Cerebras' Unique Wafer-Scale Approach
Purpose built for training the industry's largest AI models, the 5nm-based, 4 trillion transistor WSE-3 powers the Cerebras CS-3 AI supercomputer, delivering 125 petaflops of peak AI performance through 900,000 AI optimized compute cores.
With a huge memory system of up to 1.2 petabytes, the CS-3 is designed to train next generation frontier models 10x larger than GPT-4 and Gemini.
In August 2024, Cerebras unveiled its AI inference service, claiming to be the fastest in the world and, in many cases, ten to twenty times faster than systems built using the dominant technology, Nvidia's H100 "Hopper" graphics processing unit, or GPU.
Square-shaped with 21.5 centimeters to a side, it uses nearly an entire 300-millimeter wafer of silicon to make one chip. Chipmaking equipment is typically limited to producing silicon dies of no more than about 800 square millimeters.
As many as 2,048 systems can be combined, a configuration that would chew through training the popular LLM Llama 70B from scratch in just one day.
2.5 Pros and Cons of Each Architecture
Groq LPU
Advantages:
Low-latency inference (low TTFT)
High energy efficiency (1-3 joules/token)
Predictable performance through deterministic execution
Easy deployment with air-cooled systems
Disadvantages:
Limited on-chip memory (230MB) requires multiple chips for large models
Training not possible (inference only)
Lack of versatility
Cerebras WSE-3
Advantages:
Industry's largest single chip
Massive on-chip memory (44GB SRAM)
Supports 24 trillion parameter models
Supports both training and inference
Disadvantages:
Liquid cooling system required
High cost
Complex yield management due to wafer-scale
Large data center footprint required
SambaNova SN40L
Advantages:
Three-tier memory system (SRAM + HBM + DDR)
Flexible memory hierarchy
Maintains full FP16 precision
CoE architecture optimization
Air-cooled system
Disadvantages:
Proprietary platform (SambaNova Suite required)
Limited ecosystem
On-premises deployment complexity
Google TPU v5e
Advantages:
Excellent price-performance ratio
Google Cloud integration
Transformer model optimization
Massive scalability (Multislice)
PyTorch, JAX, TensorFlow support
Disadvantages:
Google Cloud dependency
No on-premises deployment
Different development environment from CUDA ecosystem
3. GroqCloud: Cloud Infrastructure Strategy
3.1 Service Structure
GroqCloud is an API-based inference service launched in public preview in early 2024.
API Interface
Provides OpenAI API-compatible endpoints
Supported models: Llama 3 (8B, 70B), Mixtral 8x7B, Gemma 7B, and other open-source models
JSON mode, function calling, streaming support
Performance Characteristics
Traditional accelerators achieve speed through aggressive quantization, forcing models into INT8 or lower precision numerics that introduce cumulative errors throughout the computation pipeline and lead to loss of quality. Groq uses TruePoint numerics, which changes this equation. TruePoint is an approach which reduces precision only in areas that do not reduce accuracy.
TruePoint format stores 100 bits of intermediate accumulation - sufficient range and precision to guarantee lossless accumulation regardless of input bit width.
3.2 Customer Cases and Real-World Usage
Perplexity AI
March 2024 announcement: Perplexity AI uses Groq as its inference infrastructure
Developer Community
Groq powers AI apps for more than 2 million developers (up from about 356,000 last year)
Revenue Target
2024 target: $500 million revenue
Sells chip access through the GroqCloud platform
4. Nvidia-Groq Transaction: Strategic Value Analysis
4.1 Transaction Structure and Background
Investment and Valuation History
Groq raised $750 million at a valuation of about $6.9 billion in September 2024. Investors in the round included BlackRock, Neuberger Berman, Samsung, Cisco, Altimeter and 1789 Capital.
Alex Davis's firm Disruptive has invested more than half a billion dollars in Groq since the company was founded in 2016.
Transaction Circumstances
Groq was not pursuing a sale when it was approached by Nvidia.
The deal represents by far Nvidia's largest purchase ever. The chipmaker's biggest acquisition to date came in 2019, when it bought Israeli chip designer Mellanox for close to $7 billion. At the end of October 2024, Nvidia had $60.6 billion in cash and short-term investments, up from $13.3 billion in early 2023.
Transaction Scope
Nvidia is getting all of Groq's assets, though its nascent Groq cloud business is not part of the transaction.
4.2 Similar Transaction Pattern
The deal follows a familiar pattern in recent years where the world's biggest technology firms pay large sums in deals with promising startups to take their technology and talent but stop short of formally acquiring the target.
Similar Cases:
Microsoft: Top AI executive came through a $650 million deal with a startup that was billed as a licensing fee
Meta: Spent $15 billion to hire Scale AI's CEO without acquiring the entire firm
Amazon: Hired away founders from Adept AI
Nvidia: Did a similar deal in September 2024, shelling out over $900 million to hire Enfabrica CEO Rochan Sankar and other employees, and to license the company's technology
4.3 Antitrust Considerations
Bernstein analyst Stacy Rasgon's analysis: "Antitrust would seem to be the primary risk here, though structuring the deal as a non-exclusive license may keep the fiction of competition alive (even as Groq's leadership and, we would presume, technical talent move over to Nvidia)."
Rasgon added that "Nvidia CEO Jensen Huang's relationship with the Trump administration appears among the strongest of the key US tech companies."
4.4 Strategic Assets Nvidia Acquires
1. Strengthening Competitiveness in the Inference Market
Groq specializes in what is known as inference, where artificial intelligence models that have already been trained respond to requests from users. While Nvidia dominates the market for training AI models, it faces much more competition in inference, where traditional rivals such as Advanced Micro Devices have aimed to challenge it as well as startups such as Groq and Cerebras Systems.
Groq's primary rival in the approach is Cerebras Systems, which Reuters reported in early December 2024 plans to go public as soon as next year. Groq and Cerebras have signed large deals in the Middle East.
Nvidia's Huang spent much of his biggest keynote speech of 2025 arguing that Nvidia would be able to maintain its lead as AI markets shift from training to inference.
2. Acquiring TPU Expertise
Groq was founded in 2016 by a group of former engineers, including Jonathan Ross, the company's CEO. Ross was one of the creators of Google's tensor processing unit, or TPU, the search giant's custom chip that's being used by some companies as an alternative to
Nvidia's graphics processing units.
In its initial filing with the SEC, announcing a $10.3 million fundraising in late 2016, Groq listed as principals Ross and Douglas Wightman, an entrepreneur and former engineer at the Google X "moonshot factory."
3. Low-Latency Inference Technology
Groq links together LPU-equipped servers into inference clusters using an internally developed interconnect called RealScale.
Processors use a clock to control the frequency at which their circuits carry out calculations. Usually, the clock is implemented as a tiny quartz crystal. Crystal-based drift is a phenomenon that causes the clock to unexpectedly slow its frequency, which can introduce inefficiencies into AI inference workflows. Groq says that RealScale can automatically adjust processor clocks to mitigate the issue.
4. Acquiring Technical Talent
As part of this agreement, Jonathan Ross, Groq's Founder, Sunny Madra, Groq's President, and other members of the Groq team will join Nvidia to help advance and scale the licensed technology.
4.5 Nvidia's Integration Plan and Expected Synergies
Nvidia CEO Jensen Huang stated in an email to employees: "We plan to integrate Groq's low-latency processors into the NVIDIA AI factory architecture, extending the platform to serve an even broader range of AI inference and real-time workloads."
Expected Synergy Effects:
1. Workload Distribution Optimization
Training: Leverage GPU strengths (massive parallel processing, high memory capacity)
Inference: Leverage LPU strengths (low latency, high throughput, low energy consumption)
Provide customers with optimized hardware choices for each use case
2. Real-Time Inference Enhancement
Chatbots and conversational AI: Immediate response generation
Voice assistants: Support natural conversation flow
Autonomous driving: Enable real-time decision making
Financial trading: Meet millisecond-level inference requirements
3. Energy Efficiency Improvement
1-3 joules per token vs GPU's 10-30 joules
Reduced data center operating costs
Contributes to sustainability goals
Lower power consumption and cooling costs at scale deployment
4. Product Portfolio Completion
Addresses Nvidia's existing gap: overwhelming advantage in training, intensifying competition in inference
Provides complete AI pipeline: end-to-end solution from training to deployment
Responds to competitors: Addresses intensifying competition from inference-dedicated chips like AMD, AWS Inferentia, Google TPU
5. Market Dominance Expansion
Proactive response at the point when AI market shifts from training-centric to inference-centric
Neutralizes inference specialist competitors like Groq and Cerebras
Enables comprehensive solutions for cloud providers (AWS, Azure, GCP)
6. Technology Fusion Possibilities
Groq's deterministic execution + Nvidia's CUDA ecosystem
Groq's compiler technology + Nvidia's TensorRT
Hybrid architecture: Train complex models on GPU, deploy on LPU
4.6 Market Context
Competitive Landscape
AI chipmaker Cerebras Systems had planned to go public in 2024 but withdrew its IPO filing in October after announcing that it raised over $1 billion in a fundraising round. In a filing with the SEC, Cerebras said it does not intend to conduct a proposed offering "at this time," but didn't provide a reason.
Nvidia's Aggressive Investment Activity
In September 2024, Nvidia said it intended to invest up to $100 billion in OpenAI, with the startup committed to deploying at least 10 gigawatts of Nvidia products. That same month, Nvidia said it would invest $5 billion in Intel as part of a partnership.
5. Conclusion: Significance of the Transaction
This deal represents Nvidia's largest ever and marks approximately a 3x premium to Groq's September 2024 valuation of $6.9 billion.
What Nvidia Acquired Through This Deal:
Strengthened dominance in the AI inference market - Following training market monopoly, now dominating the inference market
Acquisition of core talent with Google TPU development experience - Jonathan Ross and core engineering team
Integration of low-latency inference technology into its product lineup - Expansion of AI factory architecture
Ability to provide comprehensive solutions in both training and inference - Established position as end-to-end AI infrastructure provider
Energy-efficient inference technology - Reduced data center operating costs and improved sustainability
Unique Structure of the Deal:
Groq will continue to operate as an independent company and GroqCloud will continue to operate without interruption, making this a strategic partnership that combines technology licensing and talent acquisition rather than a complete acquisition. This structure allows Nvidia to secure core technology and talent while avoiding antitrust regulations.
Industry Implications:
At a critical juncture when the AI market is shifting from training-centric to inference-centric, Nvidia has secured a preemptive advantage in next-generation AI infrastructure competition through this deal. Groq's LPU technology simultaneously meets the core requirements of real-time AI applications—low latency and high efficiency—and is expected to accelerate AI adoption across various domains including autonomous driving, robotics, real-time translation, and financial trading.


