Software Engineer - Machine Learning
Role details
Job location
Tech stack
Job description
We're looking for a Senior Machine Learning Engineer to accelerate our AI research-to-production pipeline. You'll build and improve the infrastructure that enables our research team to rapidly deploy and safely test new models, while helping ensure our production inference systems remain efficient, scalable, and reliable. You'll identify gaps and opportunities in our ML infrastructure, scope solutions to ambiguous technical problems, and help set the technical direction for how we bridge research innovation and production reliability. This role requires a strong backend engineering background in distributed systems and containerization, and a track record of independently driving projects from concept to delivery. This is a cross-functional role that requires close collaboration with both research teams developing models and engineering teams supporting the broader platform., * Design and implement tooling that enables researchers to quickly deploy and evaluate new models in production
- Design, build, and maintain high-performance, cost-efficient inference pipelines, making architectural decisions about scaling, reliability, and cost trade-offs
- Proactively identify and resolve infrastructure bottlenecks, proposing and scoping improvements to iteration speed and production reliability
- Develop and maintain user-facing APIs that interact with our ML systems
- Implement comprehensive observability solutions to monitor model performance and system health
- Troubleshoot and lead resolution of complex production issues across distributed systems, driving root-cause analysis and implementing preventive measures
- Set the direction for and continuously improve our MLOps practices, identifying the highest-impact opportunities to reduce friction between research and production.
- Collaborate closely with research and engineering teams to align on technical direction, and help onboard and mentor engineers on ML infrastructure best practices.
Requirements
- Strong backend engineering experience with Python
- Experience building and operating distributed, containerized applications, preferably on AWS
- Proficiency implementing observability solutions (monitoring, logging, alerting, tracing) for production systems
- Ability to design and implement resilient, scalable architectures
- Track record of independently scoping and delivering complex technical projects from problem identification through production deployment
- Comfort navigating ambiguity and making pragmatic technical decisions when requirements are unclear or evolving
An ideal candidate should also have some of the following:
- MLOps experience, including familiarity with PyTorch and Kubernetes
- Experience working in fast-paced environments where you owned technical direction for an area and drove projects with minimal oversight.
- Experience collaborating with remote, globally distributed teams
- Comfort working across the entire ML lifecycle from model serving to API development
- Experience in audio-related domains (ASR, TTS, or other domains involving audio processing)
- Experience with other cloud providers
- Familiarity with Bazel and monorepos
- Experience with alternative ML inference frameworks beyond PyTorch
- Experience with other programming languages
- Experience mentoring junior engineers or onboarding teammates onto complex systems