Senior DevOps Engineer (Big Data)

Xebia
Carballedo, Spain
2 days ago

Role details

Contract type
Permanent contract
Employment type
Full-time (> 32 hours)
Working hours
Regular working hours
Languages
English
Experience level
Senior

Job location

Carballedo, Spain

Tech stack

Bash
Big Data
Cloud Storage
Data Transmissions
Software Debugging
Linux
DevOps
File Systems
Job Scheduling
Python
Package Management Systems
Performance Tuning
Ansible
Scripting (Bash/Python/Go/Ruby)
Google Cloud Platform
Spark
Amazon Web Services (AWS)
Data Management
Slurm
Terraform
Docker

Job description

  • Lead the migration of on-premises SLURM-based HPC clusters to Google Cloud Platform.
  • Optimize SLURM configurations and workflows to ensure efficient use of cloud resources.
  • Automate cluster deployment, configuration, and maintenance tasks using scripting languages (Python, Bash) and automation tools (Ansible, Terraform).
  • Integrate HPC software stack using tools like Spack for dependency management and easy installation of HPC libraries and applications.
  • Provide expert-level support for performance tuning, job scheduling, and cluster resource optimization.
  • Stay current with emerging HPC technologies and GCP services to continually improve HPC cluster performance and cost efficiency.

Requirements

  • Minimum 5 years of experience with HPC environments, including SLURM workload manager, MPI, and other HPC-related software.
  • Extensive hands-on experience managing Linux-based systems, including performance tuning and troubleshooting in an HPC context.
  • Proven experience migrating and managing SLURM clusters in cloud environments, preferably GCP.
  • Proficiency with automation tools such as Ansible and Terraform for cluster deployment and management.
  • Experience with Spack for managing and deploying HPC software stacks.
  • Strong scripting skills in Python, Bash, or similar languages for automating cluster operations.
  • In-depth knowledge of GCP services relevant to HPC, such as Compute Engine (GCE), Cloud Storage, and VPC networking.
  • Google Cloud Professional DevOps Engineer or similar GCP certifications.
  • Experience with performance profiling and debugging tools for HPC applications.
  • Advanced knowledge of HPC data management strategies, including parallel file systems and data transfer tools.
  • Experience with Singularity and Docker in HPC contexts.
  • Experience with Spark or other big data tools in an HPC environment is a plus.

Apply for this position