Site Reliability Engineer (AWS) at Spectrum IT Recruitment | KeyStep

Job Description

Site Reliability Engineer (SRE) | AWS | Kubernetes

Fully Remote (UK)

24/7 Shift Pattern (28-day rota including days & nights)

£ Competitive + Bonus + Excellent Benefits

Build resilient cloud platforms that support critical national services. We're recruiting Site Reliability Engineers to join a global leader in AI-powered customer experience and cloud technology. Following the award of a major government programme, they're expanding their engineering teams to build and support highly secure, cloud-native platforms that deliver sensitive communication services.

This is an opportunity to join an organisation investing heavily in modern cloud engineering, automation and reliability. Working as part of a collaborative SRE team, you'll help ensure large-scale production environments remain secure, available and resilient, whilst continuously improving the way they're operated through automation and engineering best practice.

If you enjoy solving production challenges, improving reliability and automating away operational toil, we'd love to hear from you. What you'll be doing

Monitoring and maintaining highly available production platforms running in AWS

Responding to and managing production incidents across a 24/7 service

Investigating complex technical issues and restoring services quickly and effectively

Developing automation to reduce manual operational tasks and improve platform resilience

Building and improving monitoring, alerting and observability across cloud environments

Working alongside Software, Platform, Cloud and Security Engineers to improve reliability and operational excellence

Contributing to post-incident reviews and driving continuous service improvements

Supporting containerised workloads using Kubernetes and Docker

What we're looking for You'll ideally have experience in a Site Reliability Engineering, Production Engineering, Cloud Operations or NOC environment with exposure to:

Linux systems administration

AWS cloud infrastructure

Kubernetes and Docker

Production support and incident management

Python, Bash or Go scripting

Monitoring and observability platforms such as Grafana, Prometheus, Datadog, Splunk or CloudWatch

Networking fundamentals including DNS, TCP/IP and load balancing

A passion for automation, continuous improvement and operational excellence

Experience with Infrastructure as Code (Terraform), SRE principles (SLIs, SLOs), or regulated environments would be beneficial but isn't essential. Why join? This is far more than a traditional NOC role.

You'll be joining an engineering-led organisation where reliability, automation and continuous improvement sit at the heart of the platform. Rather than simply responding to incidents, you'll work to prevent them by improving systems, automating operational processes and helping shape the future of highly resilient cloud services.

If you're passionate about building reliable cloud platforms and enjoy solving complex technical problems in large-scale production environments, we'd love to hear from you.

Apply today or contact Dave Carlisle at Spectrum IT Recruitment for a confidential discussion.

Spectrum IT Recruitment (South) Limited is acting as an Employment Agency in relation to this vacancy.

Site Reliability Engineer (AWS)

Skills & Technologies

Job Description

Company & Role Analysis

Similar roles

Site Reliability Engineer

Senior Site Reliability Engineer

Site Reliability Engineer (m/f/d)

Senior Software Engineer, Site Reliability Engineering, Cloud IRT

Senior Software Engineer, Site Reliability Engineering, Cloud IRT

Site Reliability Engineer , Cryptography, Access and Identity Services