Head of Site Reliability Engineering
Role details
Job location
Tech stack
Job description
We are looking for a Head of Site Reliability Engineering (SRE) to lead the SRE division and take end-to-end ownership of reliability across our platform. In this role, you will:
- Define and drive the SRE strategy, vision, and roadmap for dLocal.
- Lead and grow a multi-region SRE organization, including SRE Technical Referents and engineers at different seniority levels.
- Partner closely with Product, Engineering, and Platform leaders to ensure we can scale safely, with clear reliability guardrails and strong operational excellence.
This is a high-impact, hands-on leadership role reporting to VP of Cloud Platform for someone who can move comfortably between strategy, architecture, and execution, while coaching and empowering a senior, distributed team., * Own the global reliability strategy for dLocal's platforms and services, aligning SRE goals with company and product objectives.
- Define and socialize SRE standards and principles (SLIs/SLOs/SLAs, error budgets, production readiness, incident management practices, capacity planning, etc.).
- Lead the SRE division: set org structure, define roles and scopes, and drive hiring, performance, and career development.
- Build a culture of high ownership, continuous improvement, and data-driven decisions across all reliability-related work., * Ensure our most critical systems meet or exceed availability, latency, and performance targets.
- Oversee and continuously evolve incident management (on-call strategy, incident response, communication, postmortems, follow-ups, and KPIs).
- Own the strategy for observability and monitoring (metrics, logs, traces) and alerting across all environments, including tool selection, standards, and adoption.
- Drive operational excellence: reduce toil via automation, improve deployment safety, and standardize production practices across teams.
Architecture and technical direction
- Partner with Architecture, Platform, and Product Engineering leaders to define reliable, scalable architectures for our core systems and critical flows.
- Guide the adoption of best practices in automation and Infrastructure as Code (IaC) across SRE and dependent engineering teams.
- Sponsor and oversee large cross-team reliability programs, such as major observability migrations, resilience testing frameworks, or reliability improvements for key products.
- Provide senior technical leadership on capacity planning, performance engineering, resilience and disaster recovery., + Lead, mentor, and coach SRE Leader, Technical Referents, and senior ICs, helping them grow in both technical depth and leadership.
-
Collaborate closely with: o Product & Engineering to balance feature delivery and reliability. o Security, Cloud Platform, and Infrastructure to ensure secure and robust foundations. o Business stakeholders (e.g., Operations, Support, Commercial) to align on reliability expectations and SLAs.
-
Communicate clearly about risk, trade-offs, and priorities to both technical and non-technical audiences, including senior leadership.
Requirements
Do you have experience in Leadership?, * Solid experience leading SRE / Production Engineering / Platform teams in high-availability, high-scale environments (fintech, payments, or similarly critical domains is a plus).
- Proven track record managing managers and senior ICs, building and scaling distributed technical teams.
- Deep hands-on expertise in:
- Reliability engineering: SLIs/SLOs, error budgets, capacity planning, resilience and disaster recovery.
- Incident management: on-call models, incident response, postmortems, continuous improvement of incident processes.
- Observability and monitoring: metrics, logs, traces, alerting strategies, and ecosystem of tools.
- Automation and IaC: strong familiarity with modern CI/CD pipelines, configuration management, and infrastructure as code.
- Ability to shape technical strategy, translate it into a clear roadmap, and ensure consistent execution across multiple teams.
- Excellent communication and influencing skills; comfortable driving alignment across Engineering, Product, and non-technical stakeholders.
- Strong analytical and problem-solving skills, able to operate effectively in ambiguous, fast-changing contexts.
- Professional proficiency in English; comfortable working in a global, multi-time-zone, multicultural environment., * Experience in payments / fintech or other regulated, mission-critical industries.
- Hands-on background as an SRE, Senior/Staff Engineer, or Platform Engineer before moving into leadership.
- Experience implementing or maturing:
- Centralized observability platforms and unified alerting strategies.
- Standardized production readiness reviews and reliability sign-off processes.
- Chaos engineering / resilience testing practices.