A Mathematical Theory Changes AWS Data Centers: Up to 33% More Speed
Amazon Web Services has detailed Resilient Network Graphs (RNG), a new network architecture for data centers that replaces traditional hierarchical topologies with a "flat" model based on random graph theory. According to the company, this is the first large-scale implementation of this approach in hyperscale environments and represents a significant shift in how data is routed within cloud infrastructures.
The new architecture has already been put into production: the first deployment occurred in 2024 at an AWS data center in Dublin, followed by expansions in Germany and Spain. Amazon claims that RNG is now the default network for most AWS workloads and will serve as the foundation for future installations.
For decades, large data centers have utilized hierarchical topologies known as fat-trees, where routers and switches are organized in successive layers. While this scheme has proven reliable and easy to manage, it has some structural limitations: traffic tends to concentrate at specific points in the network, creating bottlenecks and leaving parts of the available capacity unused.
The idea of replacing this organization with a more distributed network is not new. For years, academia has studied the application of random graph theory to network infrastructures, hypothesizing that greater freedom in connections between nodes can improve resilience, bandwidth utilization, and the ability to absorb high loads. However, no one had managed to transform these theoretical models into usable infrastructure at the scale demanded by large cloud operators.
The project stemmed from the collaboration between AWS's principal applied scientist Giacomo Bernardi, networking expert Ratul Mahajan, and mathematician Seshadhri Comandur, a professor at the University of California, Santa Cruz, and Amazon Scholar.
The team had to confront three fundamental problems. The first involved the physical realization of the network: connecting millions of optical fibers in random patterns risked turning data centers into an unmanageable tangle of cables. To overcome this obstacle, AWS developed ShuffleBox, a passive optical device without power that internally reorganizes connections according to a deterministic scheme. This way, it is possible to achieve the desired statistical behavior of random graphs while maintaining a replicable, standardized, and easily installable structure in any facility.
The second problem was related to traffic routing. In a hierarchical network, the path of packets is relatively predictable; in a network made up of thousands of distributed connections, however, identifying the optimal route becomes much more complex. AWS then developed a dedicated protocol called Spraypoint. Unlike traditional approaches that favor the shortest path, Spraypoint distributes traffic across a large number of available routes simultaneously. An initial "spray" phase spreads data to nearby routers, while a subsequent "point" phase directs them to the final destination. According to the researchers, the goal is not always to choose the shortest route but to exploit as many alternative paths as possible to reduce congestion and improve resilience in the event of failures.
Before proceeding with the physical realization, AWS invested significant resources in validating the model. Bernardi and Mahajan used huge simulations run on Amazon EC2 to test the behavior of the network in hundreds of thousands of different scenarios. Approximately 530 equivalent years of processing were used, which is the work a single processor would have done over more than half a millennium. The results showed promising performance, but it was Comandur's mathematical contribution that provided the theoretical proof necessary to predict the network's behavior at any operational scale.
Before introducing ShuffleBox, the team even built a prototype by manually wiring the optical connections according to the scheme proposed by the random graphs. The test confirmed in practice what had been observed in theoretical models and simulations.
Ratul Mahajan, Giacomo Bernardi, and Seshadhri Comandur
The benefits touted by AWS are significant. In tests under real traffic conditions, RNG allowed for a throughput increase of up to 33% compared to traditional hierarchical architectures. The company also claims that the new structure requires 69% less networking equipment between servers and the final destination. This would translate into a reduction of infrastructure costs by up to 45%, with potential savings quantified in billions of dollars across the entire AWS global infrastructure.
On the energy front, Amazon expects a 40% decrease in electricity consumption associated with networking equipment, resulting in a reduction of CO2 emissions at sites where the new architecture will be implemented.
Perhaps the most interesting aspect is that AWS managed to introduce this transformation without completely redesigning the existing infrastructure. The new network continues to use routers, optical modules, transceivers, and cablings already employed in current data centers. The main innovations are represented precisely by Spraypoint and ShuffleBox.
AWS believes that RNG could represent a significant competitive advantage, transforming a theory that has remained confined to academia for years into an operational technology destined to support the next generation of cloud infrastructures.