Senior Data & MLOps Engineer

CoreWeave Europe

Charing Cross, United Kingdom

3 days ago

Role details

Contract type

Permanent contract

Employment type

Full-time (> 32 hours)

Working hours

Regular working hours

Languages

English

Experience level

Senior

Job location

Remote

Charing Cross, United Kingdom

Tech stack

Artificial Intelligence

C++

Information Engineering

Distributed Systems

Python

Reliability Engineering

Management of Software Versions

Data Processing

Data Ingestion

PyTorch

Large Language Models

Spark

Kubernetes

Kafka

Slurm

Machine Learning Operations

Microservices

Job description

The Data Science team is focused on developing an advanced reliability platform. This system covers various aspects of data processing and analysis, including data intake, deriving meaningful metrics, identifying unusual patterns, predicting potential issues, finding slow processes in distributed systems, and using automated analysis to determine causes. We collaborate closely with internal teams like Fleet, Infrastructure, and AI Platform to enhance system stability, optimize resource use, shorten resolution times, and maintain service availability and financial performance., As a Senior Data & MLOps Engineer, you will design and scale the infrastructure supporting the GPU Intelligence Platform. This involves building pipelines for handling data, features, model training, and delivering insights and predictions for system health and optimization. You will transition the system from initial prototypes to a production environment operating across the fleet, focusing on scalability, separating real-time service from periodic processing, and dynamic resource management based on system load and data frequency. You will architect and deploy these scalable distributed services using orchestration technologies., * Design and implement scalable data ingestion pipelines.

Build feature processing and baseline computation systems.
Productionize models for prediction and detection.
Develop and operate low-latency service and robust offline workflows.
Architect horizontally scalable services with clear separation between components, leveraging orchestration for distribution.
Implement monitoring and feedback loops for continuous model and signal improvement.
Collaborate with Platform teams to integrate operational signals into monitoring and diagnostics.
Implement a scalable solution for mitigation and structured analysis.

Requirements

Do you have experience in Spark?, Do you have a Master's degree?, * 7+ years of experience in data engineering, distributed systems, MLOps, or infrastructure ML roles in production environments.

Proven experience building high-throughput streaming or telemetry pipelines (e.g., Kafka, Pulsar, Kinesis, or equivalent).
Strong experience designing time-series feature pipelines and operating large-scale observability systems.
Experience building and maintaining feature stores and ensuring offline/online feature parity.
Hands-on experience deploying ML models to production, including versioning, monitoring, rollback, and drift detection.
Experience designing scalable microservices deployed in Kubernetes-based environments.
Strong proficiency in Python and at least one systems language (Go, Rust, or C++).
Experience working with distributed compute or training systems (e.g., NCCL, PyTorch Distributed, Spark, Ray, Slurm).
Familiarity with GPU telemetry systems such as NVML or DCGM and hardware-level monitoring concepts.
Demonstrated experience scaling systems from Proof-of-Concept to production-grade, fleet-level deployments.

Preferred:

Experience working on GPU fleet management, hyperscale infrastructure, or AI training clusters.
Experience building anomaly detection or failure prediction systems for hardware or distributed systems.
Experience implementing distributed straggler detection or collective-level performance analysis systems.
Experience developing agentic or LLM-powered reasoning systems for diagnostics or operational intelligence.
Background in reliability engineering or SRE practices.

Wondering if you're a good fit? We believe in investing in our people, and value candidates who can bring their own diversified experiences to our teams - even if you aren't a 100% skill or experience match. Here are a few qualities we've found compatible with our team. If some of this describes you, we'd love to talk.

You love building systems that turn raw infrastructure telemetry into actionable intelligence.
You're curious about distributed systems failure modes, GPU performance pathologies, and reliability engineering at scale.
You're excited by the idea of moving from anomaly detection to prediction to autonomous root cause reasoning.
You enjoy designing platforms that protect uptime, revenue, and customer trust through proactive systems thinking.

Benefits & conditions

In addition to a competitive salary, we offer a variety of benefits to support your needs, including:

Family-level Medical Insurance
Family-level Dental Insurance
Generous Pension Contribution
Life Assurance at 4x Salary
Critical Illness Cover
Employee Assistance Programme
Tuition Reimbursement
Work culture focused on innovative disruption

About the company

CoreWeave is The Essential Cloud for AI . Built for pioneers by pioneers, CoreWeave delivers a platform of technology, tools, and teams that enables innovators to build and scale AI with confidence. Trusted by leading AI labs, startups, and global enterprises, CoreWeave combines superior infrastructure performance with deep technical expertise to accelerate breakthroughs and turn compute into capability. Founded in 2017, CoreWeave became a publicly traded company (Nasdaq: CRWV) in March 2025. Learn more at www.coreweave.com., Why CoreWeave? At CoreWeave, we work hard, have fun, and move fast! We're in an exciting stage of hyper-growth that you will not want to miss out on. We're not afraid of a little chaos, and we're constantly learning. Our team cares deeply about how we build our product and how we work together, which is represented through our core values: * Be Curious at Your Core * Act Like an Owner * Empower Employees * Deliver Best-in-Class Client Experiences * Achieve More Together We support and encourage an entrepreneurial outlook and independent thinking. We foster an environment that encourages collaboration and enables the development of innovative solutions to complex problems. As we get set for takeoff, the organization's growth opportunities are constantly expanding. You will be surrounded by some of the best talent in the industry, who will want to learn from you, too. Come join us! To fulfil our obligation to protect client data, successful applicants offered employment with CoreWeave will be required to complete a basic criminal record check, conducted in compliance with GDPR. Employment offers are conditional upon receiving satisfactory check results