Site Reliability Manager (40447)
As a Site Reliability Manager, you will act as a domain coach and advocate for best practices in a complex environment. You will be responsible for identifying and eliminating single points of failure and systemic risks, ensuring overall stability. You will collaborate on capacity planning to mitigate risks and participate in change advisory processes. Your experience leading incident management and strong technical understanding of stability topics will be key to your success.
🚀 Project
- own and oversee the reliability maturity of their assigned domain
- define and execute a stability improvement roadmap
- identify and eliminate single points of failure and systemic risks
- act as a lead technical expert during major incidents
- ensure sufficient observability and monitoring
- collaborate with engineering leads, product owners, and company-wide programs
- drive blameless postmortems and systematic fixes
- collaborate on capacity planning to mitigate risks
- participate in change advisory processes to assess risk
- act as a domain coach and advocate for best practices
- guide the development path for engineers
🎯 Skills
- strong experience in software engineering, system administration, or infrastructure roles
- deep technical understanding of stability related topics
- familiarity with reliability frameworks (SRE, ITIL, DevOps)
- proficiency with observability tools (Prometheus, Grafana, ELK, etc.)
- experience leading or contributing to incident management and root cause analysis
- excellent communication skills
- experience with Change, Incident, and Problem Management frameworks
- ENG B2