Phil started his career as an Air Force Officer in the 1980s, and has since worked with private- and public-sector organizations, as well as not-for-profits, in program and project management and DevOps. He's a digital imaging specialist at the Smithsonian, enjoys live music, and is a theatre, film, cycling, hiking, and swimming enthusiast.

DevOps seeks to deliver both consistency in the pace of value delivery (development) and in system uptime (operations). These are vital to the success of a product and it can become painfully obvious when the two are out of sync.

Because of this, organizations have attempted to close the gap between the development and operations teams to avoid any holes that would be detrimental to the product.

In this article, you will learn what site reliability engineering is, SRE principles and best practices, and the relationship between SRE and standard DevOps.

Table of contents

What is site reliability engineering (SRE)?

Site reliability engineering (SRE) is a software management approach that seeks to bridge the gap between development and operations teams. SRE combines software engineering with operations principles to ensure that systems are scalable, reliable, and performing at a high level.

SRE teams lean on software, as opposed to individuals, to manage systems and automate operational processes. This helps to transition tasks that have traditionally been performed by operations teams over to automated software systems. By doing so, you can improve efficiency and reduce the potential for human error.

The origins of SRE

As we know it today, SRE started in 2003 at Google. Up until then, there was a rigid divide between the responsibilities of software engineers and system administrators — what we now refer to as the development and operations domains, respectively.

At the time, companies relied mostly on physical infrastructure. System administrators (sysadmins) were responsible for deploying code to servers and keeping infrastructure components in good working order.

Due to growth in the number of applications and traffic passing through the infrastructure, there was an increase in the number of operations staff needed to keep everything running. Meanwhile, developers still needed to push new features to production, and because of the conflict inherent in these competing development and operations business objectives, the seeds of the DevOps movement were planted.

To solve this, Google decided to let technical staff experiment with different ways of working. It began to treat traditional operational (sysadmin) concerns as software problems. As they built out their capabilities via experimentation, the early SRE teams at Google found more and more ways to solve complex software problems.

Eventually, those principles and practices, and even the SRE job function itself, have found their way to other organizations.

SRE principles and practices

By now, many organizations have experimented with SRE principles and practices. While it’s certainly true that SRE looks can vary considerably from one organization to another, there are certain patterns that are common.

Principles

When it comes to SRE principles, it’s not a simple matter of looking in a single place for a set of standards. Still, there are some common themes that emerge related to SRE principles, such as the following:

Least privilege

In product management, systems and services must be secure. One way to ensure that you protect your customers is through the concept of least privilege. In this concept, you only grant the minimum amount of access necessary to use your systems and networks. Related principles include:

Zero trust networking — The basic idea behind zero trust networking is that if a person plugs into a network port in a company conference room, their level of access should be no greater than anybody who could potentially connect to a company resource from anywhere on the internet

— The basic idea behind zero trust networking is that if a person plugs into a network port in a company conference room, their level of access should be no greater than anybody who could potentially connect to a company resource from anywhere on the internet Zero touch — With zero touch, you rely on automation and multi-party approval to remove direct human access to production roles, in favor of indirect production access

Consistent levels of service

Metrics and measures are also a key component of SRE, and three areas in particular stand out with respect to service levels:

Indicators

Service level indicators (SLIs) help you gauge whether your product delivers consistent service. Some of the most common SLIs are latency (how long it takes to get a response back from a system request), error rate (the fractional number of all requests that return an incorrect response) system throughput (the pace at which requests and responses reach their intended target), and availability (how much of the time the system or service is actually usable).

Objectives

In SRE, we use service level objectives (SLOs) to specify an ideal value or range of values for how well a service or system should perform.

If you consider latency, for instance, there might be an SLO that specifies what the average latency should be for a particular operation, such as how long it should take to return a search result. The relationship between SLIs and SLOs is similar to the relationship between key performance indicators (KPIs) and key results (KRs), where in the former case, you articulate what you seek to achieve, and in the latter, you specify a measure or metric that tells you whether you have been successful.

Agreements

Service level agreements (SLAs) tell you if you have met your SLOs. When SLAs exist with external parties, they tend to be contractual, where failing to achieve what’s stipulated in the SLA might result in a rebate or a penalty that the third party incurs. Internal SLAs can take various forms, and often inform staffing for on-call rotations.

Operational efficiency

In SRE circles, the term operational efficiency has to do with “reducing toil,” where “toil” is work that may have one or more of the following attributes:

Manual — Invoking a script

— Invoking a script Repetitive — Addressing particular types of user problems over and over

— Addressing particular types of user problems over and over Tactical — Responding to pager alerts

— Responding to pager alerts Ephemeral — Digging into legacy code without finding a root cause and effecting no change in the system or service

Observability

When a system or service is observable, it means you can make inferences about its internal state, based on what you discern from external outputs. Monitoring is an enabler of observability, where monitoring gives you capabilities to observe how a system or service performs over time.

Three standard components of a monitoring solution include metrics, logs, and traces, where:

Metrics tell you the “what”

Logs tell you the “why”

Traces tell you the “where”

An observable system or service is one where you can leverage the data and insights that monitoring produces to gain a holistic understanding of the overall health of the larger solution. By doing so, monitoring provides the instrumentation that enables effective decision-making across your portfolio of systems and services.

Practices

Similar to how there are too many SRE principles to cover, there are even more SRE practices. Below is a representative sample:

Alerting

Alerting practices are closely related to the observability principle. SRE faces the challenge that:

A large number of components are apart of complex systems

The cognitive load (maintenance burden) needs to remain low for those responding to system alerts

It’s important when configuring alerting at scale to prevent alerts for single-machine failures. Instead of focusing on single components, focus on what the aggregation of signals tells you once outliers have been pruned away. You need to tune your monitoring systems and the accompanying alerts based on SLOs, while also retaining the ability to inspect specific components where there is a need to do so.

On-call

In the SRE domain, there’s a need to respond to situations that may require attention, and in some cases, immediate remediation. When setting up on-call rotations, you need to consider areas such as the following:

Quantity — Limiting the amount of time any one SRE needs to be on-call

— Limiting the amount of time any one SRE needs to be on-call Quality — Allowing sufficient time for staff to handle incidents and any follow-up items from those incidents

— Allowing sufficient time for staff to handle incidents and any follow-up items from those incidents Compensation — Finding financial and other ways to recognize staff for working off-hours

— Finding financial and other ways to recognize staff for working off-hours Safety — Creating a blame-free environment for staff to work in

Incident response

There’s a big difference between managed and unmanaged incidents. While it’s not possible to plan for every scenario in advance, there are many proactive steps you can take, such as:

Preparation — Having well-documented incident management procedures in place

— Having well-documented incident management procedures in place Alignment — Ensuring there’s clarity during every incident on what needs to happen next, who’s responsible for what, and what needs to be captured for incident follow-up

— Ensuring there’s clarity during every incident on what needs to happen next, who’s responsible for what, and what needs to be captured for incident follow-up Trust — Making it clear to all responders that they have the autonomy they need to investigate and close the incident

— Making it clear to all responders that they have the autonomy they need to investigate and close the incident Changing course — Recognizing when the base set of assumptions may have been off the mark and adjusting tactics accordingly as new information emerges

Load balancing

When load balancing, you should carefully consider each of following steps based on your organization’s specific needs and context:

Network edge (DNS) — Updating your list of known DNS resolvers based on analysis of traffic changes and estimates of the geographic distribution of the users behind each tracked resolver

— Updating your list of known DNS resolvers based on analysis of traffic changes and estimates of the geographic distribution of the users behind each tracked resolver Network edge (virtual IP address) — Leveraging load balancer capabilities to receive and forward packets behind a virtual IP for further processing

— Leveraging load balancer capabilities to receive and forward packets behind a virtual IP for further processing Data center (flow control) — Putting in place policies that recognize when to stop sending requests to a particular backend resource

— Putting in place policies that recognize when to stop sending requests to a particular backend resource Data center (connection pool) — Applying a set of techniques by subsetting to limit the pool of backend tasks with which any given client request can interact

Continuous learning culture

By now, it’s a well established practice that continuous learning and improvement are vital to long-term success. Within an SRE context, incident post-mortems (which are conceptually similar to retrospectives) often have the following characteristics:

Established ground rules — Having a shared understanding of what types of incidents may require a post-mortem and what happens during such conversations

— Having a shared understanding of what types of incidents may require a post-mortem and what happens during such conversations Identifying root causes — Ensuring that root cause analysis (RCA) identifies what happened during the incident, why it happened, and what follow-up steps may be needed to prevent a recurrence

— Ensuring that root cause analysis (RCA) identifies what happened during the incident, why it happened, and what follow-up steps may be needed to prevent a recurrence Blameless culture — Recognizing that the purpose of each post-mortem is continuous learning and improvement, without pointing fingers or singling people out

— Recognizing that the purpose of each post-mortem is continuous learning and improvement, without pointing fingers or singling people out Recognition — Finding ways to express appreciation to people for going the extra mile or simply being great team players

— Finding ways to express appreciation to people for going the extra mile or simply being great team players Encouraging feedback — Making sure that we make it clear to participants that feedback on the postmortem process itself is encouraged and will be acted upon

SRE and DevOps: Similarities and differences

At the start of this article, we introduced SRE by referencing ideas that are central to DevOps. It’s easy to get confused when it comes to what constitutes DevOps versus what constitutes SRE, so let’s take a look at the similarities and differences.

Similarities

Willingness to change processes — Practitioners of DevOps and SRE recognize the complexity of their operational environment and the need to adapt their approach based on emerging information

— Practitioners of DevOps and SRE recognize the complexity of their operational environment and the need to adapt their approach based on emerging information Mostly automated software change management — While it is difficult to avoid some manual intervention, both SRE and DevOps practitioners favor smaller changes over large ones

— While it is difficult to avoid some manual intervention, both SRE and DevOps practitioners favor smaller changes over large ones Tooling is important, but it’s not an end in itself — It’s important to have tooling in place that minimizes toil, enables observability, and streamlines code deployments, however, organizational culture and team dynamics are just as important

— It’s important to have tooling in place that minimizes toil, enables observability, and streamlines code deployments, however, organizational culture and team dynamics are just as important Team-based collaboration — A shared ownership model featuring intra-team and cross-team collaboration is necessary for long-term success

— A shared ownership model featuring intra-team and cross-team collaboration is necessary for long-term success Data-driven decision-making — In SRE, SLOs (and the related metrics and agreements) drive what happens to maintain and improve services and systems, and in DevOps, performance against DevOps Research Association (DORA) and similar metrics is central to what actions they choose to take

— In SRE, SLOs (and the related metrics and agreements) drive what happens to maintain and improve services and systems, and in DevOps, performance against DevOps Research Association (DORA) and similar metrics is central to what actions they choose to take Continuous improvement — It’s important to surface areas where improvement may be needed

Differences

Management of code artifacts — In the DevOps domain, teams create code and other technical artifacts as part of product definition and delivery, while SRE focuses on capabilities such as robust code delivery pipelines and making sure the complete infrastructure ecosystem is functioning as expected

— In the DevOps domain, teams create code and other technical artifacts as part of product definition and delivery, while SRE focuses on capabilities such as robust code delivery pipelines and making sure the complete infrastructure ecosystem is functioning as expected Nature of day-to-day interactions — Feature development teams have a mixture of skill sets, such as product management, process management and facilitation, software engineering, quality assurance, data analytics, data science, and user experience. In the SRE space, the scope of interactions is typically more narrow, where technical practitioners usually work in areas such as such as security, cloud, and networking, with the exception of on-call and incident management, where a broader cross-section of representation is common

— Feature development teams have a mixture of skill sets, such as product management, process management and facilitation, software engineering, quality assurance, data analytics, data science, and user experience. In the SRE space, the scope of interactions is typically more narrow, where technical practitioners usually work in areas such as such as security, cloud, and networking, with the exception of on-call and incident management, where a broader cross-section of representation is common Success measures — Arguably the most obvious difference between DevOps and SRE comes down to what a “good day” looks like. In the former case, it’s common to celebrate completion of user stories, fixing bugs, and delivery of business value to customers; in the latter case, it looks more like building out and supporting capabilities and making improvements that support SLOs

Conclusion

As with any set of principles and practices, it’s necessary for each organization to run small experiments and see what works best for their context. You can learn a great deal from the application of SRE principles (such as least privilege, consistent levels of service, operational efficiency, and observability) and practices (alerting, on-call, incident response, load balancing, and continuous learning culture).

To dive deeper into the topics we’ve touched on here, feel free to browse the set of SRE books that Google has made available for free online reading.

Featured image source: IconScout