Senior Data & MLOps Engineer
Role details
Job location
Tech stack
Job description
The Data Science team is focused on developing an advanced reliability platform. This system covers various aspects of data processing and analysis, including data intake, deriving meaningful metrics, identifying unusual patterns, predicting potential issues, finding slow processes in distributed systems, and using automated analysis to determine causes. We collaborate closely with internal teams like Fleet, Infrastructure, and AI Platform to enhance system stability, optimize resource use, shorten resolution times, and maintain service availability and financial performance., As a Senior Data & MLOps Engineer, you will design and scale the infrastructure supporting the GPU Intelligence Platform. This involves building pipelines for handling data, features, model training, and delivering insights and predictions for system health and optimization. You will transition the system from initial prototypes to a production environment operating across the fleet, focusing on scalability, separating real-time service from periodic processing, and dynamic resource management based on system load and data frequency. You will architect and deploy these scalable distributed services using orchestration technologies., * Design and implement scalable data ingestion pipelines.
- Build feature processing and baseline computation systems.
- Productionize models for prediction and detection.
- Develop and operate low-latency service and robust offline workflows.
- Architect horizontally scalable services with clear separation between components, leveraging orchestration for distribution.
- Implement monitoring and feedback loops for continuous model and signal improvement.
- Collaborate with Platform teams to integrate operational signals into monitoring and diagnostics.
- Implement a scalable solution for mitigation and structured analysis.
Requirements
Do you have experience in Spark?, Do you have a Master's degree?, * 7+ years of experience in data engineering, distributed systems, MLOps, or infrastructure ML roles in production environments.
- Proven experience building high-throughput streaming or telemetry pipelines (e.g., Kafka, Pulsar, Kinesis, or equivalent).
- Strong experience designing time-series feature pipelines and operating large-scale observability systems.
- Experience building and maintaining feature stores and ensuring offline/online feature parity.
- Hands-on experience deploying ML models to production, including versioning, monitoring, rollback, and drift detection.
- Experience designing scalable microservices deployed in Kubernetes-based environments.
- Strong proficiency in Python and at least one systems language (Go, Rust, or C++).
- Experience working with distributed compute or training systems (e.g., NCCL, PyTorch Distributed, Spark, Ray, Slurm).
- Familiarity with GPU telemetry systems such as NVML or DCGM and hardware-level monitoring concepts.
- Demonstrated experience scaling systems from Proof-of-Concept to production-grade, fleet-level deployments.
Preferred:
- Experience working on GPU fleet management, hyperscale infrastructure, or AI training clusters.
- Experience building anomaly detection or failure prediction systems for hardware or distributed systems.
- Experience implementing distributed straggler detection or collective-level performance analysis systems.
- Experience developing agentic or LLM-powered reasoning systems for diagnostics or operational intelligence.
- Background in reliability engineering or SRE practices.
Wondering if you're a good fit? We believe in investing in our people, and value candidates who can bring their own diversified experiences to our teams - even if you aren't a 100% skill or experience match. Here are a few qualities we've found compatible with our team. If some of this describes you, we'd love to talk.
- You love building systems that turn raw infrastructure telemetry into actionable intelligence.
- You're curious about distributed systems failure modes, GPU performance pathologies, and reliability engineering at scale.
- You're excited by the idea of moving from anomaly detection to prediction to autonomous root cause reasoning.
- You enjoy designing platforms that protect uptime, revenue, and customer trust through proactive systems thinking.
Benefits & conditions
In addition to a competitive salary, we offer a variety of benefits to support your needs, including:
- Family-level Medical Insurance
- Family-level Dental Insurance
- Generous Pension Contribution
- Life Assurance at 4x Salary
- Critical Illness Cover
- Employee Assistance Programme
- Tuition Reimbursement
- Work culture focused on innovative disruption