Senior MLOps Platform Architect (AWS | Kubernetes | Terraform)

Salve.Inno Consulting
Municipality of Madrid, Spain
2 days ago

Role details

Contract type
Permanent contract
Employment type
Full-time (> 32 hours)
Working hours
Regular working hours
Languages
English
Experience level
Senior

Job location

Municipality of Madrid, Spain

Tech stack

Artificial Intelligence
Airflow
Amazon Web Services (AWS)
Cloud Computing
Computer Programming
Continuous Integration
DevOps
Distributed Systems
Python
Machine Learning
Prometheus
System Availability
Large Language Models
Grafana
Gitlab
Kubernetes
Machine Learning Operations
Terraform
Natural Language Understanding
Docker
Jenkins

Job description

and high availability. Kubernetes & Model Deployment * Architect build and operate production Kubernetes clusters. * Containerize and productize ML models (Docker Helm). * Deploy latency-sensitive and high-throughput models (ASR / TTS / NLU / Agentic AI). * Ensure GPU and accelerator nodes are properly integrated and optimized. CI / CD for Machine Learning * Build automated training validation and deployment pipelines (GitLab / Jenkins). * Implement canary blue-green and automated rollback strategies. * Integrate MLOps lifecycle tools (MLflow Kubeflow SageMaker Model Registry etc.). Observability & Reliability * Implement full observability (Prometheus Grafana). * Own uptime performance and reliability for ML production services. * Establish monitoring for latency drift model health and infrastructure health. Collaboration & Technical Leadership * Work closely with ML engineers researchers and data scientists. * Translate experimental models into

Requirements

production-ready deployments. * Define best practices for MLOps across the company. Requirements : * 5 years in a Senior DevOps SRE or MLOps Engineering role supporting production environments. * Strong experience designing building and maintaining Kubernetes clusters in production. * Hands-on expertise with Terraform (or similar IaC tools) to manage cloud infrastructure. * Solid programming skills in Python or Go for building automation tooling and ML workflows. * Proven experience creating and maintaining CI / CD pipelines (GitLab or Jenkins). * Practical experience deploying and supporting ML models in production (e.g. ASR TTS NLU LLM / Agentic AI). * Familiarity with ML workflow orchestration tools such as Kubeflow Apache Airflow or similar. * Experience with experiment tracking and model registry tools (e.g. MLflow SageMaker Model Registry ). * Exposure to deploying models on GPU or specialized hardware (e.g. Inferentia Trainium ). * Solid understanding of cloud infrastructure on AWS including networking scaling storage and security best practices. * Experience with deployment tooling (Docker Helm) and observability stacks (Prometheus Grafana). Ways to Know Youll Succeed * You enjoy building platforms from the ground up and owning technical decisions. * Youre comfortable collaborating with ML engineers researchers and software teams to turn research into stable production systems. * You like solving performance automation and reliability challenges in distributed systems. * You bring a structured pragmatic and scalable approach to infrastructure design. * Energetic and proactive individual with a natural drive to take initiative and move things forward. * Enjoys working closely with people - researchers ML engineers cloud architects product teams. * Comfortable sharing ideas openly challenging assumptions and contributing to technical discussions. * Collaborative mindset : you like

Benefits & conditions

to build together not work in isolation. * Strong ownership mentality - you enjoy taking responsibility for systems end-to-end. * Curious hands-on and motivated by solving complex technical challenges. * Clear communicator who can translate technical work into practical recommendations. * Thrives in a fast-paced environment where you can experiment improve and shape how things are done. Whats on Offer : * Competitive fixed compensation based on experience and expertise. * Work on cutting-edge AI systems used globall. * Dynamic multi-disciplinary teams engaged in digital transformation. * Remote-first work model * Long-term B2B contract * 20 days paid time off * Apple gear * Training & development budget Diversity and Inclusion Commitment We are dedicated to creating and sustaining an inclusive respectful workplace for all -regardless of gender ethnicity or background. We actively encourage applicants from all identities and experience levels to apply and bring your authentic self to our fast-paced supportive team. Key Skills Apache Hive,S3,Redshift,Spark,AWS,Solr,NoSQL,Data Warehouse,Internet Of Things,Kafka,DynamoDB,ZooKeeper Employment Type: Full Time Experience: years Vacancy: 1 #J-18808-Ljbffr

Apply for this position