Serverless deployment of (large) NLP models

How do you fit a 400MB NLP model into a 250MB serverless function? Learn the model distillation and dependency tricks that make it possible.

#1about 9 minutes

Exploring practical NLP applications at Slido

Several NLP-powered features are used to enhance user experience, including keyphrase extraction, sentiment analysis, and similar question detection.

#2about 4 minutes

Choosing serverless for ML model deployment

Serverless was chosen for its ease of deployment and minimal maintenance, but it introduces challenges like cold starts and strict package size limits.

#3about 8 minutes

Shrinking large BERT models for sentiment analysis

Knowledge distillation is used to train smaller, faster models like TinyBERT from a large, fine-tuned BERT base model without significant performance loss.

#4about 8 minutes

Building an efficient similar question detection model

Sentence-BERT (SBERT) provides an efficient alternative to standard BERT for semantic similarity, and knowledge distillation helps create smaller, deployable versions.

#5about 3 minutes

Using ONNX Runtime for lightweight model inference

The large PyTorch library is replaced with the much smaller ONNX Runtime to fit the model and its dependencies within AWS Lambda's package size limits.

#6about 3 minutes

Analyzing serverless ML performance and cost-effectiveness

Increasing allocated RAM for a Lambda function improves inference speed, potentially making serverless more cost-effective than a dedicated server for uneven workloads.

#7about 3 minutes

Key takeaways for deploying NLP models serverlessly

Successful serverless deployment of large NLP models requires aggressive model size reduction, lightweight inference libraries, and an understanding of the platform's limitations.