The Secret of Blackwell Is Not the Hardware: NVIDIA Reveals Where the True Advantage in AI Lies

With the gradual adoption of AI by businesses, NVIDIA believes that the most important parameter is no longer exclusively represented by the theoretical power of GPUs, but rather by the economic efficiency of inference. The benchmark indicator thus becomes the cost per token, that is, the number of tokens processed per dollar/euro invested, per watt consumed, while meeting the latency requirements demanded by applications.

According to NVIDIA, the result increasingly depends on the software that complements the hardware. The company claims that its complete inference stack, developed alongside GPUs, CPUs, NVLink networks, and Blackwell systems, has already managed to reduce the cost per token of the DeepSeek V4 model by up to five times over about a month, thanks to continuous software updates, without changes to the physical infrastructure.

NVIDIA highlights how modern AI agent applications are very different from traditional web services. While the latter handled relatively predictable requests to databases and backend services, the new AI agents can orchestrate simultaneously language models, external tools, memory, security components, and numerous distributed subprocesses. A single request can thus transform into hundreds of sub-tasks executed on different GPUs, CPUs, DPUs, and storage systems. In this scenario, NVIDIA argues, it is the software that determines how effectively the available resources are utilized and, consequently, the final cost of inference.

The company divides its software stack into three main levels:

Operational management of production, with orchestration, autoscaling, and memory allocation.
Application acceleration through runtime optimizations, kernel fusion, and overlap between computation and communication.
Direct access to the hardware functionalities of GPUs, network, and memory without requiring low-level management from developers.

The goal is to coordinate all these elements so that the improvements achieved at each level add up instead of remaining isolated.

NVIDIA asserts that several technologies introduced in recent months provide significant benefits even individually but offer the maximum advantage when used together. Among these are disaggregated serving, Large Expert Parallelism leveraging NVLink interconnection, NVFP4 numerical precision, and Multi-Token Prediction (MTP), a technique that allows predicting multiple tokens simultaneously. According to the benchmarks presented by the company, the combination of these optimizations can increase throughput by up to twenty times compared to the base configuration, as long as the entire software stack can correctly coordinate runtime, communication libraries, kernels, and resource management.

NVIDIA's statements are supported by examples provided by companies using Blackwell in production. For instance, Baseten claims to have achieved up to 50% more tokens per second thanks to TensorRT-LLM and further proprietary optimizations in executing DeepSeek V4 Pro. DigitalOcean, collaborating with Hippocratic AI in the healthcare sector, states that it has increased inference throughput by 30% while maintaining a response time of less than half a second during approximately ten million patient calls.

An important part of NVIDIA's strategy concerns integration with the open-source ecosystem. Many AI frameworks, including PyTorch, vLLM, and SGLang, are developed with native support for CUDA, allowing new inference techniques to immediately take advantage of NVIDIA's GPU hardware features.

The company cites, for example, DFlash for speculative decoding, which promises throughput increases of up to 15 times on existing hardware, and FastVideo, a technology capable of generating Full HD videos in less than five seconds. According to NVIDIA, direct integration of these innovations into the most widely-used frameworks allows for the rapid transfer of research results to production implementations.

For NVIDIA, therefore, the competitive advantage does not exclusively reside in the Blackwell hardware, but in the continuous evolution of the entire software stack and the open-source ecosystem, which allows increasing the efficiency of so-called "AI factories" even after the installation of the infrastructure.