DevOps / MLOps Engineer
Role details
Job location
Tech stack
Job description
A leading global technology business is seeking a Senior MLOps Engineer to support the evolution and scalability of its machine learning infrastructure. This role offers the opportunity to work on a high-traffic platform with millions of daily data points, enabling meaningful real-world impact through advanced ML systems across areas like content recommendation, safety, and user engagement. The ideal candidate will bring deep experience in managing scalable Kubernetes environments, cloud-native infrastructure, and MLOps tooling, enabling rapid iteration and high-throughput model deployment., * Scale and optimise an internal MLOps platform used across multiple MLfocused teams
-
Drive automation, testing reliability, and performance improvements across ML pipelines
-
Manage and fine-tune GPU-accelerated Kubernetes clusters to support highavailability, cloud-native workloads
-
Support production readiness and system reliability through on-call participation and proactive monitoring
-
Evaluate and implement modern MLOps tooling in alignment with the company's cloud and ML strategy
-
Collaborate closely with machine learning engineers and product stakeholders to ensure infrastructure meets evolving project demands
-
Share knowledge across teams to elevate engineering standards in DevOps, MLOps, and infrastructure reliability
Requirements
-
Strong experience managing GPU-enabled Kubernetes clusters at scale
-
Deep understanding of the full ML lifecycle: experimentation, training, deployment, versioning, and monitoring
-
Proficiency in languages like Python, Go, or similar, with an emphasis on automation and ML tooling
-
Proven track record building infrastructure that accelerates experimentation and model deployment in cloud environments
-
Familiarity with CI/CD tools such as ArgoCD, GitHub Actions, or similar, especially for ML use cases
-
Experience with observability tools such as Prometheus, Grafana, and cloudnative monitoring solutions
-
Comfortable contributing to incident response and participating in an on-call rotation
-
Solid experience with containerisation technologies like Docker in hybrid or fully cloud-native environments
-
Working knowledge of Terraform and Infrastructure-as-Code principles
-
Keen interest in emerging MLOps technologies and cloud-native best practices
-
Self-motivated, inquisitive, and passionate about continuous learning
-
Experience with AWS or similar cloud platforms is highly desirable, especially in the ML domain