Adolf Hohl
Efficient deployment and inference of GPU-accelerated LLMs
#1about 2 minutes
The evolution of generative AI from experimentation to production
Generative AI has rapidly moved from experimentation with models like Llama and Mistral to production-ready applications in 2024.
#2about 3 minutes
Comparing managed AI services with the DIY approach
Managed services offer ease of use but limited control, while a do-it-yourself approach provides full control but introduces significant complexity.
#3about 4 minutes
Introducing NVIDIA NIM for simplified LLM deployment
NVIDIA Inference Microservices (NIM) provide a containerized, OpenAI-compatible solution for deploying models anywhere with enterprise support.
#4about 2 minutes
Boosting inference throughput with lower precision quantization
Using lower precision formats like FP8 dramatically increases model inference throughput, providing more performance for the same hardware investment.
#5about 2 minutes
Overview of the NVIDIA AI Enterprise software platform
The NVIDIA AI Enterprise platform is a cloud-native software stack that abstracts away low-level complexities like CUDA to streamline AI pipeline development.
#6about 2 minutes
A look inside the NIM container architecture
NIM containers bundle optimized inference tools like TensorRT-LLM and Triton Inference Server to accelerate models on specific GPU hardware.
#7about 3 minutes
How to run and interact with a NIM container
A NIM container can be launched with a simple Docker command, automatically discovering hardware and exposing OpenAI-compatible API endpoints for interaction.
#8about 2 minutes
Efficiently serving custom models with LoRA adapters
NIM enables serving multiple customized LoRA adapters on a single base model simultaneously, saving memory while providing distinct model endpoints.
#9about 3 minutes
How NIM automatically handles hardware and model optimization
NIM simplifies deployment by automatically selecting the best pre-compiled model based on the detected GPU architecture and user preference for latency or throughput.
Related jobs
Jobs that call for the skills explored in this talk.
Matching moments
23:24 MIN
Deploying models with TensorRT and Triton Inference Server
Trends, Challenges and Best Practices for AI at the Edge
15:54 MIN
Deploying enterprise AI applications with NVIDIA NIM
WWC24 - Ankit Patel - Unlocking the Future Breakthrough Application Performance and Capabilities with NVIDIA
09:43 MIN
The technical challenges of running LLMs in browsers
From ML to LLM: On-device AI in the Browser
13:15 MIN
Running on-device models with the WebLLM library
From ML to LLM: On-device AI in the Browser
03:36 MIN
The rapid evolution and adoption of LLMs
Building Blocks of RAG: From Understanding to Implementation
16:17 MIN
Deploying and scaling models with NVIDIA NIM on Kubernetes
LLMOps-driven fine-tuning, evaluation, and inference with NVIDIA NIM & NeMo Microservices
00:05 MIN
Introduction to large-scale AI infrastructure challenges
Your Next AI Needs 10,000 GPUs. Now What?
27:27 MIN
Matching edge AI challenges with NVIDIA's solutions
Trends, Challenges and Best Practices for AI at the Edge
Featured Partners
Related Videos
Self-Hosted LLMs: From Zero to Inference
Roberto Carratalá & Cedric Clyburn
Your Next AI Needs 10,000 GPUs. Now What?
Anshul Jindal & Martin Piercy
LLMOps-driven fine-tuning, evaluation, and inference with NVIDIA NIM & NeMo Microservices
Anshul Jindal
DevOps for AI: running LLMs in production with Kubernetes and KubeFlow
Aarno Aukia
WWC24 - Ankit Patel - Unlocking the Future Breakthrough Application Performance and Capabilities with NVIDIA
Ankit Patel
Unveiling the Magic: Scaling Large Language Models to Serve Millions
Patrick Koss
Exploring LLMs across clouds
Tomislav Tipurić
Unlocking the Power of AI: Accessible Language Model Tuning for All
Cedric Clyburn & Legare Kerrison
Related Articles
View all articles.gif?w=240&auto=compress,format)


.gif?w=240&auto=compress,format)
From learning to earning
Jobs that call for the skills explored in this talk.

AI Systems and MLOps Engineer for Earth Observation
Forschungszentrum Jülich GmbH
Jülich, Germany
Intermediate
Senior
Linux
Docker
AI Frameworks
Machine Learning

Deep Learning Solutions Architect - Inference Optimization
NVIDIA Corporation
Remote
Senior
C++
DevOps
Python
Docker
+1


Machine Learning Engineer - Large Language Models (LLM) - Startup
Startup
Charing Cross, United Kingdom
PyTorch
Machine Learning

AI & Embedded ML Engineer (Real-Time Edge Optimization)
autonomous-teaming
Canton of Toulouse-5, France
Remote
C++
GIT
Linux
Python
+1

AI & Embedded ML Engineer (Real-Time Edge Optimization)
autonomous-teaming
München, Germany
Remote
C++
GIT
Linux
Python
+1

Manager of Machine Learning (LLM/NLP/Generative AI) - Visas Supported
European Tech Recruit
Municipality of Bilbao, Spain
Junior
GIT
Python
Docker
Computer Vision
Machine Learning
+2

Solution Architect - Generative AI Data and Post-Training
NVIDIA
Canton de Plaisir, France
Senior
C++
Python
PyTorch

ML Platform Engineer - Lepton
Nvidia
Kirkburton, United Kingdom
€184-287K
Senior
Python
Docker
Ansible
Terraform
+1