Skip to Content

Site reliability engineering 101: Ensuring the reliability of your IT system

Aliasgar Muchhala
30th May 2024

In simple terms, reliability is defined as the probability of success. However, in the application world, reliability is talked about in terms of availability and measured in the context of the frequency of failures.

Reliability is important as it can help build or lose confidence in a product and an organization’s brand reputation.

In the current IT landscape, which is complex, multi-layered, and cloud-based, the traditional approach to preventing system failures doesn’t quite work.  

With so many moving parts, there are bound to be disruptions that result in failures. This requires a change in mindset to expect failures, and to build systems that are resilient to these failures. Site reliability Engineering (SRE), also known as service reliability engineering, is the approach you need to anticipate and recover from failures.

SRE applies a software engineering mindset to system administration. As a software engineer, you look at the business requirements and develop the system aligned to those requirements. Likewise, a Site Reliability engineer needs to look at how each disruption can affect the business requirement and then find a solution for it accordingly.

An Agile-focused, product-driven approach and IT/OT integration have been key drivers for the growing demand for SRE today. 

SRE began at Google around 2003 as a method to ensure Google’s website remained “as available as possible.” The team responsible for site availability applied software engineering concepts to system administration methods, which later formed the basic tenets of SRE, as described in an online book published by Google.

Like most enterprise constructs, businesses don’t need to mimic the same methods used by Google. While we need to assess these practices in the context of the enterprise, there are certain basic tenets of SRE that must be followed: 

  • Agree upon a set of service-level indicators (SLIs) and service-level objectives (SLOs) to understand the targets and measures
  • Accept failure as normal and manage an “error budget” that is used to strike a balance between system updates and system stability
  • Understand that the site reliability engineers are neither part of the development team nor the operations team. It needs a separate central team that takes the end-to-end across apps, infra, backend, frontend, middleware, etc.
  • Automate processes. A key objective of SRE is to “reduce toil.”

Does this sound familiar? A bit like DevOps, perhaps? Then click here to read our next post on how SRE is different from DevOps.

Author

Aliasgar Muchhala

India Head of Group Portfolio and India Alliances Lead, Capgemini