Platform-Engineer Devops

W3global
München, Germany
2 days ago

Role details

Contract type
Permanent contract
Employment type
Full-time (> 32 hours)
Working hours
Regular working hours
Languages
English

Job location

München, Germany

Tech stack

JavaScript
Airflow
Amazon Web Services (AWS)
Amazon Web Services (AWS)
Amazon Web Services (AWS)
JIRA
Configuration Management
Computer Networks
Continuous Integration
Data Governance
DevOps
Github
Identity and Access Management
Python
NoSQL
OpenCV
Role-Based Access Control
Prometheus
Shell Script
SQL Databases
Systems Integration
TypeScript
Management of Software Versions
Data Logging
Graphics Processing Unit (GPU)
Autoscaling
Grafana
Concurrency
Amazon Web Services (AWS)
GIT
Gitlab-ci
Kubernetes
Information Technology
Machine Learning Operations
Cloudwatch
Zendesk
Terraform
Splunk
New Relic (SaaS)
GXP
Jenkins

Job description

  • Design, deploy, and maintain Kubeflow (or equivalent) for pipeline orchestration, model training, evaluation, and serving on large image datasets; ensure reliability, security, and cost efficiency.
  • Manage and tune Kubernetes clusters (EKS/GKE/AKS), set up namespaces, RBAC, autoscaling, network policies, and service meshes where appropriate; keep upgrades and operations predictable.
  • Define infrastructure-as-code with Terraform; implement repeatable environment provisioning, configuration management, and golden paths for teams.
  • Establish CI/CD workflows (GitHub Actions/Jenkins/GitLab CI), build/test standards, and progressive delivery patterns that keep releases fast and low-risk.
  • Implement logging, metrics, and tracing (e.g., Prometheus, Grafana, CloudWatch, Splunk/New Relic) with actionable SLOs, alerts, and runbooks; embed security and compliance by design.
  • Collaborate closely with product and science teams to remove bottlenecks, eliminate manual steps, and evolve service and data interfaces that make operating image pipelines simple and reliable.
  • Contribute to future-state architectures that improve scalability, resiliency, and operational efficiency; lead targeted refactors and platform improvements.
  • Manage core automation and tooling, and educate teams on platform capabilities, CI/CD, configuration management, and infrastructure automation best practices.

Requirements

Do you have experience in Zendesk?, Do you have a Master's degree?, * M.Sc. in Computer Science/Engineering (or equivalent) or comparable industry experience.

  • Practical, production experience operating Kubeflow Pipelines for reproducible ML workflows at scale.
  • Proven experience deploying and operating workloads on Kubernetes (EKS/GKE/AKS), including upgrades, autoscaling, RBAC, networking, and reliability; strong Unix/Linux fundamentals.
  • Hands-on experience with AWS services (EKS, EC2, S3, IAM, CloudWatch; RDS a plus) and the ability to design secure, cost-aware architectures.
  • Strong Terraform skills and Git-based workflows for repeatable infrastructure provisioning and configuration management.
  • Practical experience with CI/CD platforms (GitHub Actions/Jenkins/GitLab CI), including artifact management, environment promotion, and progressive delivery. * Solid Python and/or shell scripting for platform automation and toil reduction.
  • Experience implementing logging, metrics, and tracing with SLOs, alerts, and runbooks (e.g., Prometheus, Grafana, CloudWatch, Splunk/New Relic) and a security-first mindset.
  • Ability to lead technical initiatives, communicate trade-offs clearly, and collaborate effectively with engineering and science teams

Desirable (Nice to have):

  • Experience with MLflow, Feast, Argo, Airflow, Ray, and model versioning/monitoring.
  • Familiarity with S3/object storage, artifact registries, and handling large image datasets; basic SQL/NoSQL exposure.
  • Experience with digital pathology or large-scale image processing (e.g., whole-slide images) and tools like OpenSlide, scikit-image, or OpenCV.
  • Experience tuning high-throughput pipelines, concurrency, memory usage, and integrating GPUs/accelerators.
  • Experience with VPC design, ingress/egress, service meshes, secrets management, IAM, and policy as code.
  • Experience in regulated environments (e.g., GxP), including data governance, privacy, and building software under regulated processes.
  • Experience with Jira/Zendesk and with JavaScript/TypeScript for internal tools or dashboards.

#INDEU

Apply for this position