Patrick Koss
Unveiling the Magic: Scaling Large Language Models to Serve Millions
#1about 3 minutes
Understanding the benefits of self-hosting large language models
Self-hosting LLMs provides greater control over data privacy, compliance, cost, and vendor lock-in compared to using third-party services.
#2about 4 minutes
Architectural overview for a scalable LLM serving platform
A scalable LLM service requires key components for model acquisition, inference, storage, billing, security, and request routing.
#3about 7 minutes
Choosing an inference engine and model storage strategy
Using network file storage (NFS) is crucial for reducing startup times and enabling fast horizontal scaling when deploying new model instances.
#4about 5 minutes
Building an efficient token-based billing system
Aggregate token usage with tools like Redis before sending data to a payment provider to manage rate limits and improve system efficiency.
#5about 3 minutes
Implementing robust rate limiting for shared LLM systems
Prevent system abuse by implementing both request-based and token-based rate limiting, using estimations for output tokens to protect shared resources.
#6about 3 minutes
Selecting the right authentication and authorization strategy
Bearer tokens offer a flexible solution for managing authentication and fine-grained authorization, such as restricting access to specific models.
#7about 2 minutes
Scaling inference with Kubernetes and smart routing
Use tools like KServe or Knative on Kubernetes for intelligent autoscaling and canary deployments based on custom metrics like queue size.
#8about 3 minutes
Summary of best practices for scalable LLM deployment
Key strategies for success include robust rate limiting, modular design, continuous benchmarking, and using canary deployments for safe production testing.
Related jobs
Jobs that call for the skills explored in this talk.
Matching moments
05:08 MIN
The opaque and complex stack of modern LLM services
You are not my model anymore - understanding LLM model behavior
00:05 MIN
The rise of self-hosted open source AI models
Self-Hosted LLMs: From Zero to Inference
03:36 MIN
The rapid evolution and adoption of LLMs
Building Blocks of RAG: From Understanding to Implementation
00:27 MIN
Addressing the core challenges of large language models
Accelerating GenAI Development: Harnessing Astra DB Vector Store and Langflow for LLM-Powered Apps
09:43 MIN
The technical challenges of running LLMs in browsers
From ML to LLM: On-device AI in the Browser
02:22 MIN
Understanding the limitations of large language models
What comes after ChatGPT? Vector Databases - the Simple and powerful future of ML?
19:14 MIN
Addressing data privacy and security in AI systems
Graphs and RAGs Everywhere... But What Are They? - Andreas Kollegger - Neo4j
01:09 MIN
Running large language models locally with Web LLM
Generative AI power on the web: making web apps smarter with WebGPU and WebNN
Featured Partners
Related Videos
Self-Hosted LLMs: From Zero to Inference
Roberto Carratalá & Cedric Clyburn
How AI Models Get Smarter
Ankit Patel
Using LLMs in your Product
Daniel Töws
Three years of putting LLMs into Software - Lessons learned
Simon A.T. Jiménez
Inside the Mind of an LLM
Emanuele Fabbiani
How to Avoid LLM Pitfalls - Mete Atamel and Guillaume Laforge
Meta Atamel & Guillaume Laforge
Your Next AI Needs 10,000 GPUs. Now What?
Anshul Jindal & Martin Piercy
Exploring LLMs across clouds
Tomislav Tipurić
Related Articles
View all articles.png?w=240&auto=compress,format)
.png?w=240&auto=compress,format)


From learning to earning
Jobs that call for the skills explored in this talk.

AI Systems and MLOps Engineer for Earth Observation
Forschungszentrum Jülich GmbH
Jülich, Germany
Intermediate
Senior
Linux
Docker
AI Frameworks
Machine Learning

Machine Learning Engineer - Large Language Models (LLM) - Startup
Startup
Charing Cross, United Kingdom
PyTorch
Machine Learning

Manager of Machine Learning (LLM/NLP/Generative AI) - Visas Supported
European Tech Recruit
Municipality of Bilbao, Spain
Junior
GIT
Python
Docker
Computer Vision
Machine Learning
+2


Agentic AI Architect - Python, LLMs & NLP
FRG Technology Consulting
Intermediate
Azure
Python
Machine Learning

Lead Engineer - Agentic AI Platform (AWS, Bedrock, Multi-Tenant Control Plane)
CloudiQS
Remote
£70-106K
Senior
React
Python
Node.js
+5


ML Engineer at fast-growing AI-driven platform
Jack & Jill\u002FExternal ATS
Charing Cross, United Kingdom
Python
PyTorch
TensorFlow
Machine Learning
