Senior MLOps Platform Architect (AWS | Kubernetes | Terraform)
Role details
Job location
Tech stack
Job description
and high availability. Kubernetes & Model Deployment * Architect build and operate production Kubernetes clusters. * Containerize and productize ML models (Docker Helm). * Deploy latency-sensitive and high-throughput models (ASR / TTS / NLU / Agentic AI). * Ensure GPU and accelerator nodes are properly integrated and optimized. CI / CD for Machine Learning * Build automated training validation and deployment pipelines (GitLab / Jenkins). * Implement canary blue-green and automated rollback strategies. * Integrate MLOps lifecycle tools (MLflow Kubeflow SageMaker Model Registry etc.). Observability & Reliability * Implement full observability (Prometheus Grafana). * Own uptime performance and reliability for ML production services. * Establish monitoring for latency drift model health and infrastructure health. Collaboration & Technical Leadership * Work closely with ML engineers researchers and data scientists. * Translate experimental models into
Requirements
production-ready deployments. * Define best practices for MLOps across the company. Requirements : * 5 years in a Senior DevOps SRE or MLOps Engineering role supporting production environments. * Strong experience designing building and maintaining Kubernetes clusters in production. * Hands-on expertise with Terraform (or similar IaC tools) to manage cloud infrastructure. * Solid programming skills in Python or Go for building automation tooling and ML workflows. * Proven experience creating and maintaining CI / CD pipelines (GitLab or Jenkins). * Practical experience deploying and supporting ML models in production (e.g. ASR TTS NLU LLM / Agentic AI). * Familiarity with ML workflow orchestration tools such as Kubeflow Apache Airflow or similar. * Experience with experiment tracking and model registry tools (e.g. MLflow SageMaker Model Registry ). * Exposure to deploying models on GPU or specialized hardware (e.g. Inferentia Trainium ). * Solid understanding of cloud infrastructure on AWS including networking scaling storage and security best practices. * Experience with deployment tooling (Docker Helm) and observability stacks (Prometheus Grafana). Ways to Know Youll Succeed * You enjoy building platforms from the ground up and owning technical decisions. * Youre comfortable collaborating with ML engineers researchers and software teams to turn research into stable production systems. * You like solving performance automation and reliability challenges in distributed systems. * You bring a structured pragmatic and scalable approach to infrastructure design. * Energetic and proactive individual with a natural drive to take initiative and move things forward. * Enjoys working closely with people - researchers ML engineers cloud architects product teams. * Comfortable sharing ideas openly challenging assumptions and contributing to technical discussions. * Collaborative mindset : you like
Benefits & conditions
to build together not work in isolation. * Strong ownership mentality - you enjoy taking responsibility for systems end-to-end. * Curious hands-on and motivated by solving complex technical challenges. * Clear communicator who can translate technical work into practical recommendations. * Thrives in a fast-paced environment where you can experiment improve and shape how things are done. Whats on Offer : * Competitive fixed compensation based on experience and expertise. * Work on cutting-edge AI systems used globall. * Dynamic multi-disciplinary teams engaged in digital transformation. * Remote-first work model * Long-term B2B contract * 20 days paid time off * Apple gear * Training & development budget Diversity and Inclusion Commitment We are dedicated to creating and sustaining an inclusive respectful workplace for all -regardless of gender ethnicity or background. We actively encourage applicants from all identities and experience levels to apply and bring your authentic self to our fast-paced supportive team. Key Skills Apache Hive,S3,Redshift,Spark,AWS,Solr,NoSQL,Data Warehouse,Internet Of Things,Kafka,DynamoDB,ZooKeeper Employment Type: Full Time Experience: years Vacancy: 1 #J-18808-Ljbffr