MongoDB’s Storage Layer Services (SLS) team is re-architecting the MongoDB cloud storage layer and sits at the heart of our next-generation…
Site Reliability Engineer
Skills & Technologies
Job Description
GAQ127R40
Team: IT Infrastructure and Operations
About the Role
At Databricks Information Technology, we are a product-led organization transforming how we work—from the ease of using our IT services to the applications we develop to scale seamlessly during rapid growth.
As a Site Reliability Engineer (SRE), you will bridge the gap between software engineering and systems architecture. You will be a core contributor to the IT Infrastructure team, owning the evolution of core infrastructure and observability platforms. This role requires a strong software engineering mindset and deep technical breadth to deliver high-quality, scalable solutions for "immature" system problems. Your focus will be on building resilient, automated infrastructure that empowers development teams and ensures our cloud environment is cost-optimized, secure, and highly available.
The Impact You Will Have
Architect and Automate: Design and deploy production-grade infrastructure on cloud platforms (AWS/Azure) using Infrastructure as Code (IaC) tools like Terraform or Pulumi.
Reliability and Performance Engineering:Optimize system performance, architecture, and scaling to ensure maximum uptime and minimal latency for critical IT services.
CI/CD Excellence: Architect robust deployment pipelines (e.g., GitHub Actions), managing both hosted and self-hosted runners for specialized build requirements.
Observable by Default: Create underlying infrastructure to ensure new internal applications are secure and have logging, metrics and alerts enabled by default.
Agentic ToolingI: Build internal AI plugins, and automation scripts to streamline developer workflows and enhance operational efficiency.
Incident Response: Focus on subsequent data usage, incident management workflows, and creating necessary dashboards to maintain service health. Participate in a shared on-call rotation, leading rapid incident response and technical troubleshooting for production outages.Facilitate blameless post-mor
Company & Role Analysis
JobSeeker+Neutral 2–4 sentence summary of what working at this company is like, drawn from public reviews and press coverage. Tone, collaboration style, pace, benefits highlights.
£45,000 – £60,000 (Glassdoor, Levels.fyi, 2025)
Similar roles
See moreABOUT POSTHOG We're shipping every product that companies need https://posthog.com/handbook/why-does-posthog-exist to run their business fr…
GitLab is the intelligent orchestration platform for DevSecOps. GitLab enables organizations to increase developer productivity, improve ope…
Asana’s rapid growth brings new challenges in keeping our systems fast, reliable, and resilient. As our product evolves, we’re making a majo…
Description Be an integral part of an agile team that's constantly pushing the envelope to enhance, build, and deliver top-notch reliability…
Salary: £40,000 - 50,000 per year Requirements: Strong experience in a Site Reliability Engineering, DevOps, or Platform Engineering role St…