KeyStep

Senior Software Engineer, Reliability

Klaviyo

Dublin, Ireland

about 6 hours ago

full-timeEngineering

Skills & Technologies

PythonGoSoftware EngineeringSoftware DevelopmentQuantitativeSREScalabilityLoad TestingKubernetesTerraformCloudKlaviyoDeploymentMakeAutomationCapacity Planning

Job Description

At Klaviyo, we value the unique backgrounds, experiences and perspectives each Klaviyo (we call ourselves Klaviyos) brings to our workplace each and every day. We believe everyone deserves a fair shot at success and appreciate the experiences each person brings beyond the traditional job requirements. If you’re a close but not exact match with the description, we hope you’ll still consider applying. Want to learn more about life at Klaviyo? Visit klaviyo.com/careers to see how we empower creators to own their own destiny.

Senior Software Engineer, Reliability (Dublin)

Team Overview

As a Senior Software Engineer, Reliability, you’ll ensure Klaviyo’s critical platforms are reliable, scalable, and sustainable while enabling rapid product development. We treat reliability as a core product feature and use software engineering to solve complex systems and operational challenges.

Our work spans security, infrastructure, and software development, requiring us to understand systems and engineering. We build complex, foundational solutions that must be extremely reliable, secure, and performant at global scale.

Our charter is to build and operate foundational services and infrastructure, define clear reliability objectives, reduce operational toil through automation, and continuously improve systems based on real production learnings. The work is highly visible and directly impacts how Klaviyos build software and how customers experience Klaviyo every day.

How You’ll Make an Impact

As a Senior Software Engineer, Reliability, you will build and operate the platforms, systems, and services that underpin Klaviyo’s reliability and operational excellence. You will:

Build and operate foundational, security-critical services with a strong emphasis on availability, scalability, latency, and fault tolerance

Apply software engineering principles to automate infrastructure, reduce operational toil, and improve system reliability at scale

Design, implement, and evolve systems using SRE best practices

Define and refine SLIs, SLOs, and error budgets to guide engineering decisions

Improve observability, alerting, and incident response to reduce mean time to detection and recovery

Participate in on-call rotations with a focus on sustainable operations and automatic remediations

Perform quantitative analysis to understand system behavior, capacity constraints, and scaling limits

Identify systemic risks and reliability bottlenecks and drive long-term, preventative solutions

Collaborate closely with product, platform, and security engineers to influence architecture early and ship reliable systems

Mentor and pair with other engineers, helping raise the bar for reliability, operational maturity, and engineering excellence

Who You Are

You are a cloud-native, platform-focused SRE who uses software to build and operate reliable production systems at scale.

You write and maintain production-quality code (e.g. Python, Go, or similar) to build internal platforms, automate operations, and improve system reliability

You have built, deployed, and operated distributed, cloud-native systems and understand failure modes such as partial outages, dependency failures, resource saturation, and cascading impact

You have experience operating containerized workloads and platforms (e.g. Kubernetes) in production, including deployment strategies, scaling behavior, and service networking

You are comfortable participating in on-call rotations and diagnosing production issues

You have designed and operated observability systems and know how to build actionable alerts that reflect real user and service impact

You apply SRE concepts such as SLIs, SLOs, error budgets, and burn-rate–based alerting to guide engineering decisions and operational response

You have hands-on experience with infrastructure as code and declarative configuration (e.g. Terraform, Kubernetes manifests, policy-as-code)

You have performed capacity planning, load testing, and perfo

Company & Role Analysis

JobSeeker+

Likely perks

Private MedicalPension25+ Days HolidayStock OptionsLearning BudgetFlexible Hours

Culture & working style

Neutral 2–4 sentence summary of what working at this company is like, drawn from public reviews and press coverage. Tone, collaboration style, pace, benefits highlights.

Market salary range

£45,000 – £60,000 (Glassdoor, Levels.fyi, 2025)

Unlock the full analysis for this job