Kevin Klues
A Deep Dive on How To Leverage the NVIDIA GB200 for Ultra-Fast Training and Inference on Kubernetes
#1about 2 minutes
Understanding the NVIDIA GB200 supercomputer architecture
The GB200 uses multi-node NVLink and NV switches to connect up to 72 GPUs across multiple nodes, creating a single powerful system.
#2about 2 minutes
Enabling secure multi-node GPU communication on Kubernetes
While the GPU Operator runs on GB200 nodes, it requires support for a new construct called IMEX to securely leverage multi-node NVLink connections.
#3about 2 minutes
How the IMEX CUDA APIs enable remote memory access
Applications use a sequence of CUDA API calls like `cuMemCreate` and `cuMemExportToShareHandle` to securely map and access remote GPU memory over NVLink.
#4about 4 minutes
Exploring the four levels of IMEX resource partitioning
IMEX security is managed through a four-level hierarchy, from the physical NVLink Domain down to the workload-specific IMEX Channel allocated within an IMEX Domain.
#5about 6 minutes
Abstracting IMEX complexity with the compute domain concept
The complex manual setup of IMEX daemons and channels is hidden behind a user-friendly "Compute Domain" abstraction that uses Dynamic Resource Allocation (DRA).
#6about 2 minutes
How to migrate a multi-node workload to compute domains
Migrating a workload involves creating a `ComputeDomain` object and updating the pod spec to reference its `resourceClaimTemplate` in the new `resourceClaims` section.
#7about 5 minutes
Understanding the compute domain DRA driver's architecture
The driver uses a central controller and a Kubelet plugin to orchestrate the lifecycle of IMEX daemons and channels, ensuring they are ready before workloads start.
#8about 6 minutes
Demonstrating a multi-node MPI job on a GB200 cluster
A live demo shows how to deploy the DRA driver and run an MPI job that automatically gets IMEX daemons and achieves full NVLink bandwidth across nodes.
#9about 2 minutes
Prerequisites and resources for using the DRA driver
To use the driver, you must enable DRA and CDI feature flags in Kubernetes and ensure the GPU driver includes the necessary IMEX binaries.
Related jobs
Jobs that call for the skills explored in this talk.
Wilken GmbH
Ulm, Germany
Senior
Kubernetes
AI Frameworks
+3
ROSEN Technology and Research Center GmbH
Osnabrück, Germany
Senior
TypeScript
React
+3
Matching moments
03:55 MIN
The hardware requirements for running LLMs locally
AI in the Open and in Browsers - Tarek Ziadé
01:32 MIN
Organizing a developer conference for 15,000 attendees
Cat Herding with Lions and Tigers - Christian Heilmann
06:44 MIN
Using Chrome's built-in AI for on-device features
Devs vs. Marketers, COBOL and Copilot, Make Live Coding Easy and more - The Best of LIVE 2025 - Part 3
00:48 MIN
The shift to on-device AI models in smartphones
Fake or News: Coding on a Phone, Emotional Support Toasters, ChatGPT Weddings and more - Anselm Hannemann
01:06 MIN
Malware campaigns, cloud latency, and government IT theft
Fake or News: Self-Driving Cars on Subscription, Crypto Attacks Rising and Working While You Sleep - Théodore Lefèvre
01:15 MIN
Crypto crime, EU regulation, and working while you sleep
Fake or News: Self-Driving Cars on Subscription, Crypto Attacks Rising and Working While You Sleep - Théodore Lefèvre
02:20 MIN
The evolving role of the machine learning engineer
AI in the Open and in Browsers - Tarek Ziadé
04:27 MIN
Moving beyond headcount to solve business problems
What 2025 Taught Us: A Year-End Special with Hung Lee
Featured Partners
Related Videos
Your Next AI Needs 10,000 GPUs. Now What?
Anshul Jindal & Martin Piercy
WWC24 - Ankit Patel - Unlocking the Future Breakthrough Application Performance and Capabilities with NVIDIA
Ankit Patel
From foundation model to hosted AI solution in minutes
Kevin Klues
Accelerating Python on GPUs
Paul Graham
Efficient deployment and inference of GPU-accelerated LLMs
Adolf Hohl
Accelerating Python on GPUs
Paul Graham
The Future of Computing: AI Technologies in the Exascale Era
Stephan Gillich, Tomislav Tipurić, Christian Wiebus & Alan Southall
AI Factories at Scale
Thomas Schmidt
Related Articles
View all articles



From learning to earning
Jobs that call for the skills explored in this talk.

BWI GmbH
Berlin, Germany
Senior
Linux
DevOps
Python
Ansible
Terraform
+2

BWI GmbH
München, Germany
Senior
Linux
DevOps
Python
Ansible
Terraform
+1

DATEV eG
Nürnberg, Germany
Remote
Go
GIT
DevOps
Python
+2

Nvidia
Bramley, United Kingdom
C++
PyTorch
TensorFlow

Nvidia
München, Germany
€230K
Senior
API
Terraform
Kubernetes
Amazon Web Services (AWS)

Avantgarde Experts GmbH
München, Germany
Junior
C++
GIT
CMake
Linux
DevOps
+3

Nvidia
Remote
Intermediate
C++
Python
Machine Learning
Software Architecture

Nvidia
Bramley, United Kingdom
£230K
Senior
API
Terraform
Kubernetes
Amazon Web Services (AWS)

Nvidia
Bramley, United Kingdom
£292K
Senior
C++
Linux
Node.js
PyTorch
+1