top of page

[Volume 32. Groq: LPU Architecture and the Redefinition of Inference Optimization Infrastructure]

  • Dec 25, 2025
  • 13 min read

Executive Summary

Groq is a hardware startup that developed the Language Processing Unit (LPU), a new inference-dedicated chip architecture. Founded in 2016 by Jonathan Ross, a former member of Google's TPU development team, the company adopted a deterministic execution model to address fundamental limitations of GPU-based inference. As of September 2024, the company raised $750 million at a $6.9 billion valuation and provides API access to LPU infrastructure through its cloud service called GroqCloud.


On December 24, 2025, Nvidia announced a non-exclusive licensing agreement with Groq. According to CNBC reports, the deal is valued at $20 billion, as confirmed by Alex Davis, CEO of Disruptive, Groq's largest investor. Groq stated in an official blog post that it had "entered into a non-exclusive licensing agreement with Nvidia for Groq's inference technology," without disclosing the price. As part of the agreement, Groq founder and CEO Jonathan Ross, Groq President Sunny Madra, and other senior leaders will join Nvidia to help advance and scale the licensed technology.


Nvidia CEO Jensen Huang stated in an email to employees: "We plan to integrate Groq's low-latency processors into the NVIDIA AI factory architecture, extending the platform to serve an even broader range of AI inference and real-time workloads." Huang added, "While we are adding talented employees to our ranks and licensing Groq's IP, we are not acquiring Groq as a company."

Groq will continue to operate as an independent company with Simon Edwards stepping into the role of CEO. GroqCloud will continue to operate without interruption.


1. Groq's Technical Differentiation: LPU Architecture


1.1 Fundamental Differences Between GPU and LPU


GPU (Graphics Processing Unit)

  • Thousands of small cores for parallel processing

  • Non-deterministic execution: Thread scheduling determined dynamically at runtime

  • Memory hierarchy: L1/L2 cache, HBM (High Bandwidth Memory)

  • General-purpose architecture handling both training and inference


LPU (Language Processing Unit)

  • Deterministic execution model: All execution paths determined at compile time

  • Temporal architecture: Computing reorganized in the time dimension

  • Reduced external memory access: Maximizes use of on-chip SRAM

  • Inference-specific optimization: Specialized for Transformer models' sequential token generation


Groq's LPU is built on the Tensor Streaming Processor (TSP), which focuses on deterministic and predictable execution to enhance performance for language model inference tasks, unlike traditional CPU/GPU architectures.

Traditional processors such as CPUs and GPUs have many sources of non-determinism built into their design, such as memory hierarchy, interrupts, context switching, and dynamic instruction scheduling. Groq's LPU is designed to eliminate all sources of non-determinism, enabling the compiler to statically schedule the execution of each instruction along with the flow of data through the network of chips.


1.2 Memory Architecture Innovation

Groq on-chip SRAM provides memory bandwidth upwards of 80 terabytes/second, while GPU off-chip HBM clocks in at about eight terabytes/second. That difference alone gives LPUs up to a 10X speed advantage, on top of the boost LPUs get from not having to go back and forth to a separate memory chip to retrieve data.

The LPU integrates hundreds of MB of SRAM as primary weight storage (not cache), cutting latency and feeding compute units at full speed. This enables efficient tensor parallelism across chips, a practical advantage for fast, scalable inference.


Hardware Specifications

  • GroqChip: 189 TeraFlops, 230MB of on-die memory

  • GroqNode: 1.5 PetaFlops, 1.76GB of on-die memory, 8x GroqChip

  • GroqRack: 12 PetaFlops, 14GB global SRAM, 8x server with 64+1 card


Computing Performance

  • 750 Tera Operations Per Second (TOPS) at INT8

  • 188 TeraFLOPS at FP16

  • 320×320 fused dot product matrix multiplication and 5,120 Vector ALUs


Memory Performance

  • 230MB of on-chip SRAM per chip

  • 80TB/s bandwidth

  • Minimized external memory access reduces power consumption


Groq's current chip set is built on a 14 nanometer process. As we move towards a 4 nanometer process, the performance advantages of LPU architecture will only increase.


1.3 Software-First Design Principle


The Groq LPU architecture started with the principle of software-first. The objective was to make the software developer's job of maximizing hardware utilization easier and put as much control as possible in the developer's hands.


GPUs are versatile and powerful, but they are also complex, putting extra burden on the software. It must account for variability in how a workload executes, within and across multiple chips, making scheduling runtime execution and maximizing hardware utilization much more challenging.


The Groq LPU was designed from the outset for linear algebra calculations—the primary requirement for AI inference. By limiting the focus to linear algebra compute and simplifying the multi-chip computation paradigm, Groq took a different approach to AI inference and chip design.

Software-first isn't just a design principle—it is actually how Groq built its first generation GroqChip processor. We didn't touch chip design until the compiler's architecture was designed.


1.4 Energy Efficiency

On-chip SRAM (230MB per chip, 80TB/s bandwidth) minimizes data movement, reducing power consumption to 1-3 joules per token, up to 10x more efficient than GPUs.

According to Groq, Nvidia GPUs require approximately 10 to 30 joules (J) to generate each token, whereas Groq only needs 1 to 3 joules.


1.5 Architectural Constraints

Groq LPU is not without its flaws, facing challenges in cost and versatility. The extensive clusters required to run large models incur high procurement and maintenance costs, while dedicated chips struggle to flexibly adapt to the rapidly evolving AI algorithms.

Since each Groq card has only 230MB of memory, running the Llama-2 70B model would require between 305 to 572 Groq cards, whereas only eight H100 cards would suffice.


Groq is one of a number of upstarts that do not use external high-bandwidth memory chips, freeing them from the memory crunch affecting the global chip industry. The approach, which uses a form of on-chip memory called SRAM, helps speed up interactions with chatbots and other AI models but also limits the size of the model that can be served.


2. AI Inference Chip Market Competitor Comparison


2.1 Overview of Major Competitors


The AI inference market has been centered around Nvidia GPUs, but competition has intensified since the mid-2020s with the emergence of specialized chip manufacturers adopting inference-dedicated architectures. Each vendor has adopted different technical approaches and market strategies.

Category

Groq LPU

Cerebras WSE-3

SambaNova SN40L

Google TPU v5e

Company Founded

2016

2016

2017

2015 (internal use)<br>2018 (external)

Founder

Jonathan Ross<br>(ex-Google TPU team)

Andrew Feldman<br>(ex-SeaMicro CEO)

Kunle Olukotun<br>(Stanford Professor)

Google

Architecture

Deterministic<br>Temporal Computing

Wafer-Scale<br>Integration

Reconfigurable<br>Dataflow Unit (RDU)

Tensor Processing<br>Unit (ASIC)

Chip Size

Standard chip size<br>(14nm process)

46,225 mm²<br>(entire wafer)

Standard chip size<br>(2 logic dies)

Standard chip size

Process Technology

Samsung 14nm

TSMC 5nm

TSMC 5nm

Google custom

Transistors

Undisclosed

4 trillion

Undisclosed

Undisclosed

Core Count

Undisclosed

900,000<br>AI-optimized cores

Undisclosed

1 TensorCore/chip<br>(v5e)

On-Chip Memory

230MB SRAM/chip

44GB SRAM

520MiB SRAM

16GB HBM/chip (v5e)

External Memory

None<br>(SRAM only)

None<br>(SRAM only)

64GB HBM<br>+ up to 1.5TB DDR

HBM2e/HBM3

Memory Bandwidth

80TB/s

21PB/s

>1TB/s<br>(DDR→HBM)

600GB/s (v5p)

Compute Performance

188 TFLOPs (FP16)<br>750 TOPS (INT8)

125 PFLOPs/chip

688 TFLOPs (FP16)

197 TFLOPs (BF16)

Power Consumption

275W/card

Undisclosed<br>(liquid cooling required)

600W (estimated)

200-250W/chip

Energy per Token

1-3 joules

Undisclosed

Undisclosed

Undisclosed

System Configuration

GroqRack:<br>12 PFLOPs<br>14GB global SRAM

CS-3:<br>up to 2,048 systems<br>256 ExaFLOPs

SN40L Node:<br>16 chips/rack<br>10.2 PFLOPs

TPU v5e Pod:<br>up to 256 chips<br>100 PetaOps (INT8)

Cooling Method

Air-cooled

Liquid-cooled

Air-cooled

Air-cooled

Cluster Scalability

Multi-chip network

Wafer-Scale Cluster:<br>up to 192 CS-2

8-socket node<br>peer-to-peer network

Multislice:<br>tens of thousands of chips

Primary Optimization

Sequential<br>token generation

Large-scale model<br>training & inference

CoE (Composition<br>of Experts)

Transformer<br>model optimization

Supported Model<br>Size

Llama 70B:<br>305-572 chips required

24 trillion parameters<br>(1 chip possible)

5 trillion parameters<br>(single node)

200B parameters<br>optimized (v5e)

2.2 Performance Comparison: Llama 3.1 70B Inference


According to SambaNova's comparative analysis, all three vendors demonstrate significantly better performance than GPU-based solutions for Llama 3.1 70B model inference:

Provider

Throughput (tokens/sec)

Chips/Racks

Precision

SambaNova

~461 (70B)<br>~132 (405B)

16 chips (1 rack)

Full FP16

Groq

~300-480

305-572 chips (9 racks)

Mixed FP16/FP8

Cerebras

~Similar performance

Multiple wafers

SRAM-based

SambaNova's 70B inference configuration uses just 16 chips of SN40L, with a combination of tensor parallelism across chips and pipeline parallelism within each chip. Each SN40L chip consists of two logic dies, HBM, and direct-attached DDR DRAM. The 16 chips are interconnected with a peer-to-peer network. They offer a compute roofline of 10.2 bf16 PFLOPS.


Despite having 10X more dies, a 49X higher compute roofline, and holding all the weights in SRAM, Cerebras achieves similar performance as SambaNova on Llama 3.1 70B, where SambaNova is slightly higher. Meanwhile, Groq needs 9X the rack space and 36X chips, yet still runs 46% slower than SambaNova on 70B.


2.3 Google TPU's Differentiated Approach

Cloud TPU v5e provides up to 2.3X price performance improvements over the previous generation TPU v4, making it our most cost-efficient TPU to date.


By contrast, Cloud TPU v5p is Google's most powerful TPU thus far. Each TPU v5p pod composes together 8,960 chips over the highest-bandwidth inter-chip interconnect (ICI) at 4,800 Gbps/chip in a 3D torus topology. Compared to TPU v4, TPU v5p features more than 2X greater FLOPS and 3X more high-bandwidth memory (HBM).


Designed for performance, flexibility, and scale, TPU v5p can train large LLM models 2.8X faster than the previous-generation TPU v4. Moreover, with second-generation SparseCores, TPU v5p can train embedding-dense models 1.9X faster than TPU v4.


TPU v5e Key Specifications:

  • Each v5e chip contains one TensorCore. Each TensorCore has four matrix-multiply units (MXUs), a vector unit, and a scalar unit

  • Each TPU v5e chip provides up to 393 trillion int8 operations per second, allowing complex models to make fast predictions. A TPU v5e pod delivers up to 100 quadrillion int8 operations per second, or 100 petaOps of compute power

  • TPU v5e delivers up to 2x higher training performance per dollar and up to 2.5x inference performance per dollar for LLMs and gen AI models compared to Cloud TPU v4


2.4 Cerebras' Unique Wafer-Scale Approach


Purpose built for training the industry's largest AI models, the 5nm-based, 4 trillion transistor WSE-3 powers the Cerebras CS-3 AI supercomputer, delivering 125 petaflops of peak AI performance through 900,000 AI optimized compute cores.


With a huge memory system of up to 1.2 petabytes, the CS-3 is designed to train next generation frontier models 10x larger than GPT-4 and Gemini.


In August 2024, Cerebras unveiled its AI inference service, claiming to be the fastest in the world and, in many cases, ten to twenty times faster than systems built using the dominant technology, Nvidia's H100 "Hopper" graphics processing unit, or GPU.


Square-shaped with 21.5 centimeters to a side, it uses nearly an entire 300-millimeter wafer of silicon to make one chip. Chipmaking equipment is typically limited to producing silicon dies of no more than about 800 square millimeters.

As many as 2,048 systems can be combined, a configuration that would chew through training the popular LLM Llama 70B from scratch in just one day.


2.5 Pros and Cons of Each Architecture


Groq LPU


  • Advantages:

    • Low-latency inference (low TTFT)

    • High energy efficiency (1-3 joules/token)

    • Predictable performance through deterministic execution

    • Easy deployment with air-cooled systems

  • Disadvantages:

    • Limited on-chip memory (230MB) requires multiple chips for large models

    • Training not possible (inference only)

    • Lack of versatility


Cerebras WSE-3


  • Advantages:

    • Industry's largest single chip

    • Massive on-chip memory (44GB SRAM)

    • Supports 24 trillion parameter models

    • Supports both training and inference

  • Disadvantages:

    • Liquid cooling system required

    • High cost

    • Complex yield management due to wafer-scale

    • Large data center footprint required


SambaNova SN40L


  • Advantages:

    • Three-tier memory system (SRAM + HBM + DDR)

    • Flexible memory hierarchy

    • Maintains full FP16 precision

    • CoE architecture optimization

    • Air-cooled system

  • Disadvantages:

    • Proprietary platform (SambaNova Suite required)

    • Limited ecosystem

    • On-premises deployment complexity


Google TPU v5e


  • Advantages:

    • Excellent price-performance ratio

    • Google Cloud integration

    • Transformer model optimization

    • Massive scalability (Multislice)

    • PyTorch, JAX, TensorFlow support

  • Disadvantages:

    • Google Cloud dependency

    • No on-premises deployment

    • Different development environment from CUDA ecosystem


3. GroqCloud: Cloud Infrastructure Strategy


3.1 Service Structure

GroqCloud is an API-based inference service launched in public preview in early 2024.


API Interface

  • Provides OpenAI API-compatible endpoints

  • Supported models: Llama 3 (8B, 70B), Mixtral 8x7B, Gemma 7B, and other open-source models

  • JSON mode, function calling, streaming support


Performance Characteristics


Traditional accelerators achieve speed through aggressive quantization, forcing models into INT8 or lower precision numerics that introduce cumulative errors throughout the computation pipeline and lead to loss of quality. Groq uses TruePoint numerics, which changes this equation. TruePoint is an approach which reduces precision only in areas that do not reduce accuracy.

TruePoint format stores 100 bits of intermediate accumulation - sufficient range and precision to guarantee lossless accumulation regardless of input bit width.


3.2 Customer Cases and Real-World Usage


Perplexity AI

  • March 2024 announcement: Perplexity AI uses Groq as its inference infrastructure


Developer Community

  • Groq powers AI apps for more than 2 million developers (up from about 356,000 last year)


Revenue Target

  • 2024 target: $500 million revenue

  • Sells chip access through the GroqCloud platform


4. Nvidia-Groq Transaction: Strategic Value Analysis


4.1 Transaction Structure and Background


Investment and Valuation History

Groq raised $750 million at a valuation of about $6.9 billion in September 2024. Investors in the round included BlackRock, Neuberger Berman, Samsung, Cisco, Altimeter and 1789 Capital.

Alex Davis's firm Disruptive has invested more than half a billion dollars in Groq since the company was founded in 2016.


Transaction Circumstances

Groq was not pursuing a sale when it was approached by Nvidia.

The deal represents by far Nvidia's largest purchase ever. The chipmaker's biggest acquisition to date came in 2019, when it bought Israeli chip designer Mellanox for close to $7 billion. At the end of October 2024, Nvidia had $60.6 billion in cash and short-term investments, up from $13.3 billion in early 2023.


Transaction Scope

Nvidia is getting all of Groq's assets, though its nascent Groq cloud business is not part of the transaction.


4.2 Similar Transaction Pattern

The deal follows a familiar pattern in recent years where the world's biggest technology firms pay large sums in deals with promising startups to take their technology and talent but stop short of formally acquiring the target.


Similar Cases:

  • Microsoft: Top AI executive came through a $650 million deal with a startup that was billed as a licensing fee

  • Meta: Spent $15 billion to hire Scale AI's CEO without acquiring the entire firm

  • Amazon: Hired away founders from Adept AI

  • Nvidia: Did a similar deal in September 2024, shelling out over $900 million to hire Enfabrica CEO Rochan Sankar and other employees, and to license the company's technology


4.3 Antitrust Considerations


Bernstein analyst Stacy Rasgon's analysis: "Antitrust would seem to be the primary risk here, though structuring the deal as a non-exclusive license may keep the fiction of competition alive (even as Groq's leadership and, we would presume, technical talent move over to Nvidia)."


Rasgon added that "Nvidia CEO Jensen Huang's relationship with the Trump administration appears among the strongest of the key US tech companies."


4.4 Strategic Assets Nvidia Acquires


1. Strengthening Competitiveness in the Inference Market

Groq specializes in what is known as inference, where artificial intelligence models that have already been trained respond to requests from users. While Nvidia dominates the market for training AI models, it faces much more competition in inference, where traditional rivals such as Advanced Micro Devices have aimed to challenge it as well as startups such as Groq and Cerebras Systems.


Groq's primary rival in the approach is Cerebras Systems, which Reuters reported in early December 2024 plans to go public as soon as next year. Groq and Cerebras have signed large deals in the Middle East.


Nvidia's Huang spent much of his biggest keynote speech of 2025 arguing that Nvidia would be able to maintain its lead as AI markets shift from training to inference.


2. Acquiring TPU Expertise

Groq was founded in 2016 by a group of former engineers, including Jonathan Ross, the company's CEO. Ross was one of the creators of Google's tensor processing unit, or TPU, the search giant's custom chip that's being used by some companies as an alternative to


Nvidia's graphics processing units.

In its initial filing with the SEC, announcing a $10.3 million fundraising in late 2016, Groq listed as principals Ross and Douglas Wightman, an entrepreneur and former engineer at the Google X "moonshot factory."


3. Low-Latency Inference Technology

Groq links together LPU-equipped servers into inference clusters using an internally developed interconnect called RealScale.


Processors use a clock to control the frequency at which their circuits carry out calculations. Usually, the clock is implemented as a tiny quartz crystal. Crystal-based drift is a phenomenon that causes the clock to unexpectedly slow its frequency, which can introduce inefficiencies into AI inference workflows. Groq says that RealScale can automatically adjust processor clocks to mitigate the issue.


4. Acquiring Technical Talent

As part of this agreement, Jonathan Ross, Groq's Founder, Sunny Madra, Groq's President, and other members of the Groq team will join Nvidia to help advance and scale the licensed technology.


4.5 Nvidia's Integration Plan and Expected Synergies

Nvidia CEO Jensen Huang stated in an email to employees: "We plan to integrate Groq's low-latency processors into the NVIDIA AI factory architecture, extending the platform to serve an even broader range of AI inference and real-time workloads."


Expected Synergy Effects:


1. Workload Distribution Optimization

  • Training: Leverage GPU strengths (massive parallel processing, high memory capacity)

  • Inference: Leverage LPU strengths (low latency, high throughput, low energy consumption)

  • Provide customers with optimized hardware choices for each use case


2. Real-Time Inference Enhancement

  • Chatbots and conversational AI: Immediate response generation

  • Voice assistants: Support natural conversation flow

  • Autonomous driving: Enable real-time decision making

  • Financial trading: Meet millisecond-level inference requirements


3. Energy Efficiency Improvement

  • 1-3 joules per token vs GPU's 10-30 joules

  • Reduced data center operating costs

  • Contributes to sustainability goals

  • Lower power consumption and cooling costs at scale deployment


4. Product Portfolio Completion

  • Addresses Nvidia's existing gap: overwhelming advantage in training, intensifying competition in inference

  • Provides complete AI pipeline: end-to-end solution from training to deployment

  • Responds to competitors: Addresses intensifying competition from inference-dedicated chips like AMD, AWS Inferentia, Google TPU


5. Market Dominance Expansion

  • Proactive response at the point when AI market shifts from training-centric to inference-centric

  • Neutralizes inference specialist competitors like Groq and Cerebras

  • Enables comprehensive solutions for cloud providers (AWS, Azure, GCP)


6. Technology Fusion Possibilities

  • Groq's deterministic execution + Nvidia's CUDA ecosystem

  • Groq's compiler technology + Nvidia's TensorRT

  • Hybrid architecture: Train complex models on GPU, deploy on LPU


4.6 Market Context


Competitive Landscape


AI chipmaker Cerebras Systems had planned to go public in 2024 but withdrew its IPO filing in October after announcing that it raised over $1 billion in a fundraising round. In a filing with the SEC, Cerebras said it does not intend to conduct a proposed offering "at this time," but didn't provide a reason.


Nvidia's Aggressive Investment Activity

In September 2024, Nvidia said it intended to invest up to $100 billion in OpenAI, with the startup committed to deploying at least 10 gigawatts of Nvidia products. That same month, Nvidia said it would invest $5 billion in Intel as part of a partnership.


5. Conclusion: Significance of the Transaction


This deal represents Nvidia's largest ever and marks approximately a 3x premium to Groq's September 2024 valuation of $6.9 billion.


What Nvidia Acquired Through This Deal:

  1. Strengthened dominance in the AI inference market - Following training market monopoly, now dominating the inference market


  2. Acquisition of core talent with Google TPU development experience - Jonathan Ross and core engineering team


  3. Integration of low-latency inference technology into its product lineup - Expansion of AI factory architecture


  4. Ability to provide comprehensive solutions in both training and inference - Established position as end-to-end AI infrastructure provider


  5. Energy-efficient inference technology - Reduced data center operating costs and improved sustainability


Unique Structure of the Deal:

Groq will continue to operate as an independent company and GroqCloud will continue to operate without interruption, making this a strategic partnership that combines technology licensing and talent acquisition rather than a complete acquisition. This structure allows Nvidia to secure core technology and talent while avoiding antitrust regulations.


Industry Implications:

At a critical juncture when the AI market is shifting from training-centric to inference-centric, Nvidia has secured a preemptive advantage in next-generation AI infrastructure competition through this deal. Groq's LPU technology simultaneously meets the core requirements of real-time AI applications—low latency and high efficiency—and is expected to accelerate AI adoption across various domains including autonomous driving, robotics, real-time translation, and financial trading.

 
 
 
AI Cloud Tech startup trends

© 2019-2026, Paul & Companies | AI Cloud Tech leaders Insight  All rights reserved.

  • Youtube
  • LinkedIn
bottom of page