Evolution of AI in Data Centers: Integration of Managed as a Service with NVIDIA intelligent NICs

Share it

In our line of products, Canonical has been dedicated to incorporating assistance for NVIDIA intelligent Network Interface Cards for quite a few years. One of these is Canonical’s metal-as-a-service (MAAS), which allows for the regulation and oversight of these intelligent NICs on bare-metal servers. These NVIDIA BlueField intelligent NICs are high-speed networking cards that supply sophisticated software-defined infrastructure services for data centers. They are equipped with high-performing specialized network ASICs (application-specific integrated circuits) along with robust general-purpose CPUs and RAM. These cards deliver network function acceleration, hardware offloading, and isolation that pave the way for innovative applications in network, security, and storage.

Canonical treats Data Processing Units (DPUs) as paramount in its product range, ensuring the integration of kernel drivers and overseeing the software lifecycle on the cards. Our operating systems and infrastructure software provide full utilization of the features and advantages of these cards. These programmable accelerators bring essential services crucial to Canonical’s data center networking approach.

The introduction of NVIDIA’s BlueField-3 is a source of great anticipation for us. This latest addition to the BlueField series of DPUs and Super NICs is a vital component of NVIDIA’s Spectrum-X, their AI Cloud reference plan.

This blog post will delve into the modern challenges that AI workloads pose to conventional data center network technologies. We will present NVIDIA’s Spectrum-X, which merges their BlueField-3 DPUs and Super NICs with their Spectrum-4 array of Ethernet switches to address these obstacles. In this piece, we will outline the extensive software support Canonical offers for this hardware.

Impacts of AI Training on Ethernet

Steps in Distributed Deep Learning Computation

When it comes to handling AI workloads, data center operators, enterprises, and cloud providers require a cohesive and forward-thinking approach to workload management and interconnections. Canonical is prepared to back fresh initiatives in this realm, particularly as AI and Large Language Models (LLMs) transform traditional methodologies.

Distributed deep learning encompasses three primary computation stages:

Calculating the gradient of the loss function on each GPU within each node of the distributed system
Determining the average of the gradients through inter-GPU communication over the network
Updating the model

The second step, also known as the ‘Allreduce’ distributed algorithm, is especially taxing on the network.

Challenges of RDMA on Converged Ethernet Networks

RDMA (remote direct-memory access) is a mechanism that allows software processes to share memory over the network as if they were executing on a single machine. High-bandwidth, low-latency direct GPU-to-GPU transfers across the network are one of the many applications of RDMA. It is crucial for reducing tail latency in the Allreduce algorithm.

While Ethernet is the prevailing interconnect technology, it is structured as a best-effort network that may encounter packet loss during busy network or device conditions, as discussed in our whitepaper. Congestion tends to occur during large simultaneous data transfers between nodes, particularly in the context of widely distributed parallel processing required for high-performance computing (HPC) and deep learning tasks. This congestion leads to packet loss, which can significantly impact the performance of RDMA over converged Ethernet (RoCE). Various Ethernet extensions aim to address this issue, necessitating additional processing across the Ethernet fabric’s components, including servers, switches, or routers.

Ethernet’s limitations significantly affect multi-tenant hyperscale AI clouds that support numerous AI training workloads running concurrently. The alternative, InfiniBand, is a proven networking technology for highly parallel workloads, specifically RDMA, but may limit the return on investment in this domain due to Ethernet’s scale advantages.

To cater to even the most demanding workloads, Canonical routinely enhances its operating system and infrastructure tools to facilitate the latest performance enhancements and functionalities. By closely monitoring our hardware partners’ innovations, we ensure timely and optimal support for them across all system layers. An exemplary instance is MAAS’s backing for BlueField-3, a pivotal component of NVIDIA’s proposed resolution for Ethernet challenges concerning multi-tenant AI operations.

NVIDIA AI Cloud Ingredients

With Spectrum-X, NVIDIA introduces one of the primary end-to-end Ethernet solutions designed for multi-tenant deployments where multiple AI tasks operate simultaneously. Spectrum-X encompasses various software innovations reliant on recently released hardware: Spectrum-4 Ethernet switches and BlueField-3 based DPUs and Super NICs.

NVIDIA Spectrum-X

NVIDIA Spectrum-4 Ethernet Switches

As a constituent of the Spectrum-X Network Platform Architecture, the Spectrum SN5600 embodies NVIDIA’s most recent line of Spectrum-4 Ethernet switches. It bolsters top switching capabilities with Ethernet extensions to attain unparalleled RoCE performance when coupled with BlueField-3 DPUs and Super NICs.

Switching capacity	51.2 Tb/s (terabits per second) / 33.3 Bpps (billion packets per second)
400GbE (Gigabit Ethernet) ports	128
800GbE (Gigabit Ethernet) ports	64
Adaptive routing with advanced congestion control	– Per-packet load balancing across the entire fabric – End-to-end telemetry with nanosecond-level timing accuracy from switch to host – Dynamic reallocation of packet paths and queue behavior
Universal shared buffer design	– Bandwidth fairness across flows of different sizes, safeguarding workloads from disruptive neighbors

NVIDIA Spectrum-4 SN5600 Ethernet switch features

NVIDIA BlueField-3 DPU and Super NICs

Representing a significant evolution in the BlueField lineup, the NVIDIA BlueField-3 networking platform delivers speeds up to 400 Gb/s (gigabits per second). It is purpose-built to accommodate network-intensive, highly parallelized computing and hyperscale AI workloads. BlueField-3 DPUs serve as trusted management and isolation points in AI clouds. Additionally, BlueField-3 Super NICs are closely integrated with the server’s GPU and the Spectrum-X Ethernet switches, improving network performance significantly for LLM and deep learning training. Simultaneously, they enhance power efficiency and guarantee predictable performance in multi-tenant scenarios.

Feature	Advantage
Congestion avoidance mechanisms offloading	Liberates GPU servers to focus on AI learning tasks
NVIDIA Direct Data Placement	Places out-of-order, load-balanced packets due to adaptive routing correctly in host/GPU memory
Management and control of sender’s data injection rate	Processes telemetry data from Spectrum SN5600 switches to optimize network resource sharing efficiency

NVIDIA Spectrum-X BlueField-3 SuperNICs features and advantages

Through Spectrum-X, NVIDIA presents a purpose-built networking platform for demanding AI applications, offering numerous benefits over traditional Ethernet setups.

Integration of MAAS with NVIDIA Intelligent NICs Illustrates Data Center AI Growth at Canonical

Canonical’s metal-as-a-service (MAAS) software facilitates automated management of extensive data center network and server infrastructure. It transforms the oversight of bare-metal resources into an environment akin to cloud services. To leverage the advantages of the NVIDIA Spectrum-X solution concerning the AI bottleneck in Ethernet, the operating system and network function software of BlueField-3 cards must be provisioned and configured appropriately. Consequently, MAAS receives updates to aid in BlueField-3 provisioning and lifecycle management.

Configuring your data center infrastructure to capitalize on Spectrum-X innovations is the initial step towards adapting to AI workloads. Canonical’s open-source automation and infrastructure software, MAAS and Juju, empower consistent deployment of network components, servers, and applications. They alleviate the necessity for individual component micromanagement. Furthermore, Canonical’s adept personnel rectify bugs and security flaws, while Ubuntu Pro delivers requisite certifications to ensure your data center complies with the highest security standards.

With several years of experience in deploying Smart NICs and DPUs across diverse environments, Canonical has honed its expertise and perfected the automation of configuring, updating, and integrating these systems within data centers.

Provisioning BlueField Smart NICs with MAAS and PXE

Since version 3.3, Canonical’s MAAS can remotely install the official BlueField operating system (a reference distribution of Ubuntu) on a DPU and oversee its upgrades as it would with any other server, utilizing the UEFI Preboot eXecution Environment (PXE). A parent-child model allows for managing the host-DPU relationship. Since the BlueField OS derives from Ubuntu, MAAS can present it as a host for Juju to supervise the installation of additional applications, mirroring its approach with standard Ubuntu servers.

This management style of the DPU as a host fosters deeper integration between host-executed applications and the network, storage, and security functions performed on the DPU. It delivers all the benefits associated with general-purpose offloading and acceleration, although it may not fully address the specific challenges posed by multiple simultaneous AI training workloads.

Direct Provisioning via Smart NICs’ BMC through MAAS

Another approach utilizes the DPU’s inherent physical management interface to enhance security and decouple the data center infrastructure from the workloads it supports. Functioning as a server contained within another server, the DPU establishes an environment where the infrastructure layer can function autonomously of the server, isolating it from untrusted tenant applications. In this setup, software running on the host CPU lacks direct access to the DPU. This isolated environment within the DPU simplifies scenarios where a cloud service provider manages both networking and storage in the cloud infrastructure, accessible to tenants without the risk of interference.

BlueField-2 and BlueField-3 DPUs and Super NICs equipped with a baseboard management controller (BMC) are now available and can be leveraged in this context. This BMC supports Redfish and IPMI network standards and application programming interfaces (API). Characteristics unique to the DPU necessitate adjustments to the standard MAAS processes. For instance, shutting down a DPU within a server is infeasible. Therefore, a “cold” reset, a requirement in some procedures, can only occur through power cycling the entire host—a measure that should be avoided. These adaptations to MAAS are presently in development and are anticipated as part of upcoming feature releases.

By enabling centralized management and oversight of intelligent NICs, the integration of MAAS with Spectrum-X exemplifies the prospect of next-generation data centers, making them apt for contemporary AI training and inference workloads for both individual and multi-tenant scenarios.

Final Thoughts

Canonical collaborates with industry leaders in the data center infrastructure sector, ensuring optimal support for advanced features and performance enhancements they introduce. Stay updated for future revelation of related infrastructure software and operating system releases!