DevOps / MLOps Engineer

TechNET IT Recruitment

1 month ago

Role details

Contract type

Permanent contract

Employment type

Full-time (> 32 hours)

Working hours

Regular working hours

Languages

English

Experience level

Senior

Job location

Remote

Tech stack

Amazon Web Services (AWS)

Continuous Integration

DevOps

Github

Python

Machine Learning

Recommender Systems

Prometheus

Management of Software Versions

Grafana

Reliability of Systems

Kubernetes

Performance Monitor

Machine Learning Operations

Terraform

Docker

Job description

A leading global technology business is seeking a Senior MLOps Engineer to support the evolution and scalability of its machine learning infrastructure. This role offers the opportunity to work on a high-traffic platform with millions of daily data points, enabling meaningful real-world impact through advanced ML systems across areas like content recommendation, safety, and user engagement. The ideal candidate will bring deep experience in managing scalable Kubernetes environments, cloud-native infrastructure, and MLOps tooling, enabling rapid iteration and high-throughput model deployment., * Scale and optimise an internal MLOps platform used across multiple MLfocused teams

Drive automation, testing reliability, and performance improvements across ML pipelines
Manage and fine-tune GPU-accelerated Kubernetes clusters to support highavailability, cloud-native workloads
Support production readiness and system reliability through on-call participation and proactive monitoring
Evaluate and implement modern MLOps tooling in alignment with the company's cloud and ML strategy
Collaborate closely with machine learning engineers and product stakeholders to ensure infrastructure meets evolving project demands
Share knowledge across teams to elevate engineering standards in DevOps, MLOps, and infrastructure reliability

Requirements

Strong experience managing GPU-enabled Kubernetes clusters at scale
Deep understanding of the full ML lifecycle: experimentation, training, deployment, versioning, and monitoring
Proficiency in languages like Python, Go, or similar, with an emphasis on automation and ML tooling
Proven track record building infrastructure that accelerates experimentation and model deployment in cloud environments
Familiarity with CI/CD tools such as ArgoCD, GitHub Actions, or similar, especially for ML use cases
Experience with observability tools such as Prometheus, Grafana, and cloudnative monitoring solutions
Comfortable contributing to incident response and participating in an on-call rotation
Solid experience with containerisation technologies like Docker in hybrid or fully cloud-native environments
Working knowledge of Terraform and Infrastructure-as-Code principles
Keen interest in emerging MLOps technologies and cloud-native best practices
Self-motivated, inquisitive, and passionate about continuous learning
Experience with AWS or similar cloud platforms is highly desirable, especially in the ML domain