Press "Enter" to skip to content

How NVIDIA Is Embracing Cloud Native To Scale GPU Infrastructure

Containers have emerged as the standard unit of deployment for modern workloads. Their portability and scalability enable developers and operators to rapidly build and deploy applications.

Kubernetes has become the foundation of modern infrastructure. It manages and orchestrates the infrastructure and applications running containerized workloads. It’s one of the fastest-growing open source technologies after Linux. 

Cloud native is an architectural pattern for developing and deploying containerized applications managed by Kubernetes in almost any environment, including an on-premises data center, public cloud, and even a remote edge device deployed in a ship. 

NVIDIA, the undisputed leader in the AI infrastructure and platforms, embraces containers and Kubernetes to make GPUs accessible to cloud native developers. It has built various services and tools to bridge the gap between hardware accelerators such as GPU and DPU and the Kubernetes ecosystem. 

It’s not surprising to see NVIDIA betting big on cloud native. The combination of Kubernetes and GPUs delivers unmatched scale for AI workloads. Kubernetes is designed to scale the infrastructure by pooling the compute resources from all the machines of the cluster. GPUs are used for massive parallelism required for training and inference of complex deep learning models. You get the best of both worlds when you run Kubernetes on a cluster of GPU machines – scale and parallelism. 

Some of the key platforms, such as NVIDIA DGX and EGX, run Kubernetes as the orchestration layer. NVIDIA is working with platform vendors to integrate its cloud native GPU infrastructure with Kubernetes. 

NVIDIA Container Toolkit – Exposing GPU to Containers

Container runtimes are the core building blocks of cloud native platforms. They are a part of the underlying operating system that manages the lifecycle of containers. Docker Engine and Containerd are the most popular container runtimes in the cloud native universe. 

NVIDIA has built NVIDIA Container Toolkit, an extension to Docker Engine and Containerd to make GPUs visible to containers. It is tightly integrated with the container runtime to deliver the benefits of containerization – portability and scale. 

NVIDIA Container Toolkit
NVIDIA Container Toolkit NVIDIA

Typically, administrators start the configuration by installing the graphics driver followed by the CUDA library and runtime, which are accessed from deep learning frameworks such as TensorFlow and PyTorch. The stack must run compatible versions of the GPU driver, CUDA, Python, and ML frameworks to get the best performance. Any deviation in the versions would make the GPU inaccessible to developers. 

NVIDIA Container Toolkit reduces the friction associated with configuring the GPU software stack. As long as there is a supported version of the graphics driver, developers can pull the CUDA container image without the need to install the entire stack. 

NVIDIA Container Toolkit can be installed on servers running high-end GPUs such as A100 and even on the edge devices such as Jetson Nano. 

NGC – The Hub for Cloud Native, GPU-Optimized AI Resources

The NVIDIA GPU Cloud (NGC) is a one-stop-shop for accessing GPU-optimized containerized software. One of the key elements of NGC is the NVIDIA Container Registry (NVCR), an image registry with a collection of container images that provide instant access to CUDA and HPC software. 

NGC hosts container images, Helm charts, pre-trained models, Jupyter notebooks, and toolkits such as Jarvis and TLT. Any developer can sign up with NGC for free to download GPU-optimized software. 

NGC
NGC NVIDIA

The combination of NVIDIA Container Toolkit and NVCR dramatically simplifies the workflow involved in building AI applications. 

DGX customers can sign up for NGC private registry for a dedicated and secure repository to store sensitive artifacts. 

NVIDIA DeepOps – The Open Source Installer for GPU-Ready Kubernetes and Kubeflow

NVIDIA DeepOps is an integrated open source installer for deploying Kubernetes and Kubeflow. NVIDIA has assembled the best of the breed software to accelerate the installation and configuration of a GPU cluster running Kubernetes. It relies on Kubespray to rollout Kubernetes to one or more machines. 

NVIDIA DeepOps automates the installation of Kubeflow, an open source, cloud native machine learning platform that runs on Kubernetes. 

The DeepOps installer deploys everything from the container runtime, Kubernetes control plane and nodes, NVIDIA GPU operator, an overlay storage layer and more. In just 30 minutes, customers can have a fully configured cloud native and GPU-optimized infrastructure running on a single node such as a DGX box or a cluster of GPU servers. 

The GPU Operator is a specialized software designed exclusively for Kubernetes. It containerizes everything from the driver to CUDA runtime to cuDNN libraries without explicitly installing the stack on the server. NVIDIA has also built a similar operator for the DPU to expose Mellanox SmartNIC to Kubernetes workloads. 

GPU Operator
GPU Operator NVIDIA

DeepOps supports upstream Kubernetes and enterprise container platforms such as Red Hat OpenShift. 

With containerization and cloud native technologies becoming the key for NVIDIA, it should consider acquiring a Kubernetes platform company to build its own stack. An official Kubernetes distribution from NVIDIA complements NGC, Container Toolkit, and GPU Operator to deliver an end-to-end cloud native platform. A GPU-optimized and integrated Kubernetes platform from NVIDIA can become the foundation of AI and HPC infrastructure.