AI/ML

How open systems drive AI performance

June 3, 2025

An open-source philosophy and system-level optimizations prevent software and infrastructure nightmares in GenAI deployments, says CentML CTO Shang Wang

Partner content “The magic isn’t just in the model, it’s in how you run it,” says Shang Wang, CTO of CentML. When Wang discusses large language model (LLM) performance, the dialog swiftly moves from market hype to technical heat maps, GPU optimization, network bottlenecks, and compiler intricacies. And if discussing compiler glitches and TensorRT error logs sounds dry, wait until Wang turns one of those logs into a punchline.

Open-source systems, including compilers, frameworks, runtimes, and orchestration infrastructure, are central to Wang’s vision, and the logic is straightforward.

“You can’t imagine all the corner cases yourself. Ninety-nine times out of a hundred, a TensorRT compilation blows up, and it’s closed-source, so you’re stuck,” Wang explains. “Open compilers survive because the community finds the weird stuff for you.” This philosophy of openness everywhere and optimization at every step drives CentML’s product lineup.

Hidet, CentML’s open-source ML compiler, feeds directly into CServe, its serving engine based on vLLM. This then integrates smoothly into their all-in-one AI infrastructure offering. The CentML Platform allows developers to select any open model like Llama, Mistral, or DeepSeek, point it at any hardware from NVIDIA H100s and AMD MI300Xs through to TPUs, and let the stack handle performance optimization and deployment.

One of Wang’s favorite practical examples of this approach involves optimizing and deploying AWQ-quantized DeepSeek R1 on the CentML Platform.

“At the GPU-kernel level, through Hexcute which is a DSL of the Hidet compiler, we built a fully-fused GPU kernel for the entire MoE layer which is a crucial part of DeepSeek R1,” he says.

“This sped up the MoE by 2x to 11x compared to the best alternatives out there implemented through the Triton compiler. Then, at the inference-engine-level, we built EAGLE speculative decoding which leverages a smaller draft model to reduce and help parallelize the work that the big original model has to do, which led to another 1.5-2x overall speedup,” he adds.

Wang then gives an example of how CentML Platform empowers AI practitioners: “The entire model is now made deployable on our platform, while the GPU provisioning, networking, autoscaling, fault tolerance, and all the optimizations I just mentioned are handled automatically for the users behind the scenes.”

CentML’s research isn’t just about chasing academic acclaim; it’s laser-focused on solving real-world latency and infrastructure bottlenecks. Its recent Seesaw paper, set to be presented at MLSys 2025, highlights an innovative approach to dynamically switching parallelism strategies during inference while reducing network congestion. Running a Llama model distributed across eight NVIDIA L4 GPUs interconnected via standard PCIe, the team encountered severe network overload with their initial tensor-parallel strategy during prefill, causing latency to spike dramatically.

The CentML team’s intuitive solution was highly effective: they maintained tensor parallelism for the memory-bandwidth-intensive decode stage but switched to pipeline parallelism during the compute-heavy prefill phase. “The moment we flipped strategies mid-inference, our throughput soared, and latency dropped sharply,” Wang proudly recalls.

Though first prototyped in a research setting, these cutting-edge techniques will soon transition into CentML’s production-grade CServe inference engine at the heart of the CentML Platform. Wang elaborates: “Our research engineers pursue bold ideas aimed at cracking core problems. Once validated, they’re empowered to integrate these innovations directly into our products, enjoying firsthand the real-world impact. Not every experimental idea makes it to production immediately, but the most promising ones rapidly evolve into tangible performance enhancements.”

This creates a virtuous feedback loop, where user-reported edge cases enhance downstream software capabilities, inspiring further academic research and generating additional performance improvements. Similar to how CentML contributed their work on pipeline parallelism and EAGLE speculative decoding back to the vLLM library, these ideas and implementations will be contributed back as well, making them available to everyone through a straightforward pip install.

CentML offers users simple serverless endpoints for initial experimentation and seamless transitions to dedicated deployments, empowering users to own and control their entire stack. Whether spinning up Llama 4 on a preferred cloud provider or migrating to an on-premises infrastructure, the CentML ecosystem ensures stability, flexibility, and consistency without reliance on proprietary connectors.

There’s also a compelling economic and data privacy argument behind CentML’s approach. Serverless API endpoint providers often tout access to premium GPUs and proprietary kernels, but Wang highlights a contrasting narrative: Open models combined with superior yet accessible systems can deliver significantly better performance at dramatically lower costs. To be fair, inference requests with potentially sensitive information probably shouldn’t be sent to a serverless API endpoint that is shared among many users, which is why CentML offers dedicated deployments of these optimized models.

In an internal comparison, CentML engineers tested two identical chatbots. One used a Together.ai Llama 4 Maverick endpoint and the other ran CentML’s optimized stack. The CentML version achieved improved token throughput at much lower latency time to the first token. “Same weights, same prompts, but different systems—and one dramatically lower AWS bill,” Wang notes.

Asked what keeps him awake at night, Wang bypasses industry hype to focus squarely on system bottlenecks. Specifically, memory and interconnect bandwidth is much more challenging to scale than raw compute throughput. He continuously pushes to maximize the use of every last bit of hardware resources by AI workloads and then some. This drive explains CentML’s aggressive innovation strategy, from parallelism switching to ongoing kernel optimization in Hidet, and even the resource-optimizing hardware picker embedded within their Platform.

For developers interested in experiencing CentML’s performance firsthand, Wang suggests trying out Llama 4 endpoints on their platform. Additionally, their Hidet and DeepView open-source projects are available on GitHub, where users can directly contribute by reporting edge cases or performance quirks. Wang and his team enthusiastically welcome these contributions.

In Wang’s words, “AI progress doesn’t hinge on one closed lab. The cat’s out of the bag, and the best optimizations are happening openly, collaboratively, and transparently.”

Sponsored by CentML

How open systems drive AI performance

ABOUT US

FOLLOW US

Green storage for the datacenter

VDURA unwraps ScaleFlow to slash the flash

Tech leaders struggling to store AI data, never mind manage it research shows