Analysis. Four events suggest that Nvidia is thinking about building mid-range AI inferencing systems. Bear with me as I describe the three and set them in an AI inferencing market landscape.
The first event, a bit of a blockbuster, was Nvidia spending $20,000,000,000, yes, $20B, to acqui-hire Groq’s senior talent and license its technology; an effective acquisition bypassing potential regulatory scrutiny. We first encountered Groq and its AI inference-focussed LPU (Language Processing Unit) two years ago.Founder and CEO Jonathan Ross told us its LPU technology at that point was 10x faster than an Nvidia GPU at 10 percent of the cost, and needing a tenth of the electricity.
Now Nvidia has gotten hold of its technology, which uses hardware-based parallel processing of multiple stages in AI inferencing pipelines with compiled workloads. Why? Groq said the agreement with Nvidia “reflects a shared focus on expanding access to high-performance, low cost inference.”
A Groq LPU system currently costs around $20,000 for an LPU in a double-width, PCIe Gen 4 card form-factor, and there are eight-card and rack cluster systems which will head up into the $100,000 and well beyond area. There is also a GroqCloud which has pay-per-use pricing, which we think costs $0.05–$1.00 per million input tokens and $0.08–$3.00 per million output tokens.
All-in-all, low cost this is not.
Nvidia already has its GPU technology, which has just culminated in the clustered, rack-scale Vera Rubin system and this can do both training and inferencing – but it will cost millions of dollars. It has its ARM-based Grace processor, which complements Rubin GPUs for memory management and communications work.
It also has its DGX Spark supercomputer workstation, a GPU-in-a-box for the desktop offering powerful and local AI model prototyping, development, fine-tuning, agent-building, and inferencing, costing $4,000 or so. This is in the specialized high-end Dell desktop PC price range.
The point being made here is that there is a desktop AI inferencing need and x86 CPUs are poor at inferencing work.
Event number 2 was Jensen Huang’s CES 2026 presentation during which he talked about three kinds of computer mattering to Nvidia in the robotics field: “This basic system requires three computers. One computer, of course, the one that we know that Nvidia builds for training the AI models. Another computer that we know is to inference the computer. Inference the models. Inferencing the model is essentially a robotics computer that runs in the car, runs in a robot or runs in a factory, runs anywhere at the edge, but there has to be another computer that’s designed for simulation. And simulation is at the heart of almost everything Nvidia does.”
Huang does not mention Groq here. But he does say inferencing basically runs everywhere.
The third event was Huang making a big deal out of using BlueField-4-connected NVMe SSD storage as KV cache context memory extension. This is to enable inferencing jobs, with token counts that blow past GPU HBM and CPU DRAM memory capacity and hit a context memory wall problem, sidestep it by parking context data in the SSDs, and parking it persistently. We understand persistent context for multi-turn AI agents improves responsiveness, increases AI factory throughput, and supports efficient scaling of long-context, multi-agent inference.
Nvidia’s storage partners see this as a big deal; AIC, Cloudian, DDN, Dell Technologies, HPE, Hitachi Vantara, IBM, Nutanix, Pure Storage, Supermicro, VAST Data, and WEKA are all on board..
This brings us to the fourth event, which is SSD controller and Pascari SSD company Phison enabling NVMe SSDs to be used, you guessed it, to expand GPU memory and accelerate AI inference, which it says will “significantly increase memory capacity and simplify deployment to unlock large-model AI capabilities on notebook PCs, desktop PCs, and mini-PCs.”
We can see there is an inferencing scale spectrum, from notebook PCs, desktop PCs, and mini-PCs at the bottom, through powerful workstations – DGX Spark, Groq LPU accelerator, to high-end Rubin CPX and Vera Rubin clustered rack systems.
What we are seeing, inferring, through a glass darkly as one might say, is that, assuming LLM and agentic AI takes off, there is going to be a vast low-end to midrange inferencing system market, and that Groq’s LPU could be a great fit. It doesn’t handle any storage and there would be a need for a possible Inference Appliance combining a Groq LPU, a BlueField-4 processor to handle communications, security and storage, some SSD capacity, maybe a host x86 processor, screen, etc. Envisage Nvidia producing a reference architecture blueprint for this, it being built by existing OEMs like Dell and Lenovo, links to external storage, and a cost profile in the $2,000 and up area. Would it fly?
I think it might.
Update
During an interview at CES 2026, posted on X, Huang discussed Groq and the acquisition, saying: “The thing that I really like about Groq is, although we never saw ourselves competing, they built a very specialised version of a chip that uses integrated SRAM. It can’t scale up to the scale that we could scale. It, of course, can’t be used for training and post-training. But for low latency, high token rate generation, so long as it fits in the SRAM, it’s really quite interesting. And so I really love their low latency focus. Everything about their programming model, their architecture, their system architecture was designed for low latency. Now, the extreme capability for low latency token generation is also one of the reasons why they had a very hard time addressing the mainstream part of AI factories. But in combination with us, they don’t have to address that. They no longer have to address the segment of the market that quite frankly, where the scale is.
“And maybe with us, we can go explore the edges, the fringes of where AI factories could be someday. And so I really like the team, I really like Jonathan, really talented team. I love their conviction. I think their architecture has really unique strengths for the mainstream part of the market, some real challenges. But together, those aren’t their problems anymore. They could really focus on their strengths.”
Let’s highlight some of Huang’s words here: “we can go explore the edges, the fringes of where AI factories could be someday.” To me this says it’s at the edge with inferencing workloads that Nvidia sees Groq’s future. That’s Huang’s Groq bet.








