Patrick Koss
Unveiling the Magic: Scaling Large Language Models to Serve Millions
#1about 3 minutes
Understanding the benefits of self-hosting large language models
Self-hosting LLMs provides greater control over data privacy, compliance, cost, and vendor lock-in compared to using third-party services.
#2about 4 minutes
Architectural overview for a scalable LLM serving platform
A scalable LLM service requires key components for model acquisition, inference, storage, billing, security, and request routing.
#3about 7 minutes
Choosing an inference engine and model storage strategy
Using network file storage (NFS) is crucial for reducing startup times and enabling fast horizontal scaling when deploying new model instances.
#4about 5 minutes
Building an efficient token-based billing system
Aggregate token usage with tools like Redis before sending data to a payment provider to manage rate limits and improve system efficiency.
#5about 3 minutes
Implementing robust rate limiting for shared LLM systems
Prevent system abuse by implementing both request-based and token-based rate limiting, using estimations for output tokens to protect shared resources.
#6about 3 minutes
Selecting the right authentication and authorization strategy
Bearer tokens offer a flexible solution for managing authentication and fine-grained authorization, such as restricting access to specific models.
#7about 2 minutes
Scaling inference with Kubernetes and smart routing
Use tools like KServe or Knative on Kubernetes for intelligent autoscaling and canary deployments based on custom metrics like queue size.
#8about 3 minutes
Summary of best practices for scalable LLM deployment
Key strategies for success include robust rate limiting, modular design, continuous benchmarking, and using canary deployments for safe production testing.
Related jobs
Jobs that call for the skills explored in this talk.
Wilken GmbH
Ulm, Germany
Senior
Kubernetes
AI Frameworks
+3
Picnic Technologies B.V.
Amsterdam, Netherlands
Intermediate
Senior
Python
Structured Query Language (SQL)
+1
ROSEN Technology and Research Center GmbH
Osnabrück, Germany
Senior
TypeScript
React
+3
Matching moments
03:55 MIN
The hardware requirements for running LLMs locally
AI in the Open and in Browsers - Tarek Ziadé
04:28 MIN
Building an open source community around AI models
AI in the Open and in Browsers - Tarek Ziadé
04:59 MIN
Unlocking LLM potential with creative prompting techniques
WeAreDevelopers LIVE – Frontend Inspirations, Web Standards and more
01:02 MIN
AI lawsuits, code flagging, and self-driving subscriptions
Fake or News: Self-Driving Cars on Subscription, Crypto Attacks Rising and Working While You Sleep - Théodore Lefèvre
04:57 MIN
Increasing the value of talk recordings post-event
Cat Herding with Lions and Tigers - Christian Heilmann
07:39 MIN
Prompt injection as an unsolved AI security problem
AI in the Open and in Browsers - Tarek Ziadé
05:03 MIN
Building and iterating on an LLM-powered product
Slopquatting, API Keys, Fun with Fonts, Recruiters vs AI and more - The Best of LIVE 2025 - Part 2
09:10 MIN
How AI is changing the freelance developer experience
WeAreDevelopers LIVE – AI, Freelancing, Keeping Up with Tech and More
Featured Partners
Related Videos
Self-Hosted LLMs: From Zero to Inference
Roberto Carratalá & Cedric Clyburn
How AI Models Get Smarter
Ankit Patel
Using LLMs in your Product
Daniel Töws
Three years of putting LLMs into Software - Lessons learned
Simon A.T. Jiménez
How to Avoid LLM Pitfalls - Mete Atamel and Guillaume Laforge
Meta Atamel & Guillaume Laforge
Inside the Mind of an LLM
Emanuele Fabbiani
Unlocking the Power of AI: Accessible Language Model Tuning for All
Cedric Clyburn & Legare Kerrison
Your Next AI Needs 10,000 GPUs. Now What?
Anshul Jindal & Martin Piercy
Related Articles
View all articles
.png?w=240&auto=compress,format)
.png?w=240&auto=compress,format)

From learning to earning
Jobs that call for the skills explored in this talk.

Forschungszentrum Jülich GmbH
Jülich, Germany
Intermediate
Senior
Linux
Docker
AI Frameworks
Machine Learning

Xablu
Hengelo, Netherlands
Intermediate
.NET
Python
PyTorch
Blockchain
TensorFlow
+3

Amazon.com Inc.
Senior
R
API
Unix
Perl
Ruby
+7



Scalable GmbH
Berlin, Germany
API
Data analysis
Microservices
Agile Methodologies


Barcelona Supercomputing Center
Barcelona, Spain
Intermediate
Python
PyTorch
Machine Learning
