KeyStep

Site Reliability Engineer

Databricks
Costa Rica
1 day ago
full-time

Skills & Technologies

Software EngineeringSite ReliabilitySREAWSAzureTerraformGitHub ActionsCI/CDPulumiCloudIT InfrastructureIncident ManagementGitHubDatabricksDeploymentAIAutomation

Job Description

GAQ127R40

Team: IT Infrastructure and Operations

About the Role

At Databricks Information Technology, we are a product-led organization transforming how we work—from the ease of using our IT services to the applications we develop to scale seamlessly during rapid growth.

As a Site Reliability Engineer (SRE), you will bridge the gap between software engineering and systems architecture. You will be a core contributor to the IT Infrastructure team, owning the evolution of core infrastructure and observability platforms. This role requires a strong software engineering mindset and deep technical breadth to deliver high-quality, scalable solutions for "immature" system problems. Your focus will be on building resilient, automated infrastructure that empowers development teams and ensures our cloud environment is cost-optimized, secure, and highly available.

The Impact You Will Have

Architect and Automate: Design and deploy production-grade infrastructure on cloud platforms (AWS/Azure) using Infrastructure as Code (IaC) tools like Terraform or Pulumi.

Reliability and Performance Engineering:Optimize system performance, architecture, and scaling to ensure maximum uptime and minimal latency for critical IT services.

CI/CD Excellence: Architect robust deployment pipelines (e.g., GitHub Actions), managing both hosted and self-hosted runners for specialized build requirements.

Observable by Default: Create underlying infrastructure to ensure new internal applications are secure and have logging, metrics and alerts enabled by default.

Agentic ToolingI: Build internal AI plugins, and automation scripts to streamline developer workflows and enhance operational efficiency.

Incident Response: Focus on subsequent data usage, incident management workflows, and creating necessary dashboards to maintain service health. Participate in a shared on-call rotation, leading rapid incident response and technical troubleshooting for production outages.Facilitate blameless post-mor

Company & Role Analysis

JobSeeker+
Likely perks
Private MedicalPension25+ Days HolidayStock OptionsLearning BudgetFlexible Hours
Culture & working style

Neutral 2–4 sentence summary of what working at this company is like, drawn from public reviews and press coverage. Tone, collaboration style, pace, benefits highlights.

Market salary range

£45,000 – £60,000 (Glassdoor, Levels.fyi, 2025)

Unlock the full analysis for this job
Sign in to unlock →

Similar roles

See more
PostHog
Remote
Full-time
Remote
about 21 hours ago

ABOUT POSTHOG We're shipping every product that companies need https://posthog.com/handbook/why-does-posthog-exist to run their business fr…

View Job
Asana
Warsaw, Poland
Full-time
1 day ago

Asana’s rapid growth brings new challenges in keeping our systems fast, reliable, and resilient. As our product evolves, we’re making a majo…

View Job
JPMorganChase
Glasgow, UK
£76,855 – £76,855
Full-time
about 16 hours ago

Description Be an integral part of an agile team that's constantly pushing the envelope to enhance, build, and deliver top-notch reliability…

View Job
VIQU IT
Birmingham, UK
£40,000 – £50,000
Full-time
1 day ago

Salary: £40,000 - 50,000 per year Requirements: Strong experience in a Site Reliability Engineering, DevOps, or Platform Engineering role St…

View Job
Apply NowApply with CV Improver