2023-07-12

4572

#collaboration and communication#prioritization

Praveenkumar Revankar

174271

102

Jul 12, 2023 ⋅ 16 min read

An introduction to site reliability engineering (SRE)

Praveenkumar Revankar I head a product engineering group. Passionate about building SaaS products, I'm a coder by heart, a tech enthusiast, and I love helping startups establish and scale.

Imagine you’re a product manager for an e-commerce business. The business has one million orders per day and the platform usually handles this fine.

An Introduction To Site Reliability Engineering (SRE)

The entire system has multiple services and platforms, developed and integrated to work together in serving customers from booking orders to fulfilling them. At a high level, the system architecture would look like:

System Architecture

Now imagine your customer adds their items to the cart and wants to check out when this happens:

Check out

The order service is unavailable, which prevents your customer from placing their order. In this case, the business faces losses and you weaken the trust of your customers

As a PM, you cannot afford to have any services or components in your system not function properly.

By now, you may wonder:

Why does any service go down?
What are the factors that are affecting the system and its outage?
Will your customer’s data be safe after the outage and recovery of the service?
How do you know whether the service has gone down and fix it immediately?
How can you still be in business when any of the services are down?
How can you make your system (site) reliable 99.9 percent of the time?

To answer these, you can utilize site reliability engineering.

In this article, you will learn what site reliability engineering is, its key terms and core principles, and how SRE will assist you in your role as a PM.

What is site reliability engineering (SRE)?

Site reliability engineering (SRE) is a discipline used for solving the challenges faced in running large-scale, highly distributed systems. SRE applies software engineering practices to operations and infrastructure, aiming to improve the system’s reliability and reduce the time needed in detecting and recovering from outages.

SRE incorporates aspects of software engineering and applies them to IT operations. At its core, SRE majorly focuses on the following activities:

Monitoring systems to collect data on performance, availability, and user experience
Reducing the latency for users in accessing systems
Planning capacity to meet future growth and demand
Incident management to efficiently respond to emergencies and resolve issues quickly
Root cause analysis of incidents to identify the cause of failures and discover improvement areas
Change management to ensure changes to systems are done safely and reliably
Automate repetitive tasks and write tools to streamline operations, reduce manual tasks, and improve efficiency

The history of SRE

System reliability engineering can be traced back to the early 2000s when Google was experiencing rapid growth and faced system outages and performance issues due to increased customer base and usage.

To solve the reliability problems that Google sites were facing Google formed a team of engineers led by Ben Treynor. This team was responsible for making Google websites available, performant, and reliable.

They found that the development team (responsible for releasing new features to the market) and the operations team (responsible for maintaining the systems and their operations) were working in isolation with different goals. This resulted in delays in the release of features.

As a software engineer, Treynor decided to apply engineering practices in operations to solve reliability issues. He termed the word site reliability engineering. He automated repetitive tasks in operations, developed tools for monitoring, and automated processes for change management, incident analysis, and resolution.

Over 200k developers and product managers use LogRocket to create better digital experiences

Learn more →

The SRE team quickly made a significant impact on Google’s reliability by solving site outages and performance problems. As a result, Google was able to improve its customer experience and reduce its operating costs.

The success of the SRE team at Google helped to improve its customer experience and reduce its operating costs. This led to the adoption of SRE by other companies. Today, SRE is a well-established discipline that companies of all sizes use. Below are some major events in the development of SRE:

SRE timeline

Key terms for site reliability engineering

Before proceeding further, let’s cover some key terms related to site reliability. These will help you learn the concepts of SRE:

Service level agreement (SLA)
Service level indicator (SLI)
Service level objective (SLO)
Mean time between failures (MTBF)
Mean time to repair (MTTR)
Availability
Observability
Response time
Latency
Error rate
Error budgets
Saturation

Service level agreement (SLA)

Service level agreement (SLA) is a contract agreed on between the customer and the business (service provider). An SLA is a document that records and dictates all the service levels that a business is offering with different metrics that those services should meet for a customer to be satisfied.

The SLA also records the implications or consequences the business should bear when a service level breach occurs. The consequence might be penalties a business should pay or the termination of a contract.

For example, say a business and customer agrees on a support SLA. The SLA states that any issue raised or support requested by a customer will be fulfilled within 24 hours. If the support team fails to respond or provide resolution within 24 hours, it will be considered a breach of the SLA and business would face consequences.

A typical SLA would have the following definitions:

Availability and response time
Performance metrics
Incident response and resolution
Escalation and communication
Reporting and documentation
Remediation and compensation

Service level indicator (SLI)

A service level indicator (SLI) is a metric used to measure the performance of the system or service. These are specific quantifiable values of a system that help engineers to understand the behavior and performance of the system or service.

By monitoring SLIs, engineering teams can gain insights into the health of the system. Using these indicators, helps engineers make data driven decisions, identify areas of improvement, and take preventive measures.

As an example, assume that service level objective is set to meet an accuracy of 95 percent in generated monthly reports. In this case, the SLI will be the actual accuracy attained in the report. The actual accuracy of data may be 99 percent, 100 percent, or even 94 percent.

Service level objective (SLO)

Service level objectives are specific targets or thresholds set for service level indicators (SLIs) that define the desired level of service quality, reliability, and performance for a system or service. There can be one or more SLOs in an SLA based on the agreement between the service provider and the customer.

Imagine the SLO for a website is to be available 99.9 percent of the time with a max downtime of two hours once a month.

Mean time between failures (MTBF)

The mean time between failures is the average time between two incidents. The MTBF is always calculated for a certain period such as 24 hours, six months, etc. It is an essential metric that the SRE team tracks to understand how reliable their systems are.

The MTBF can be calculated by dividing the number of hours the system is operating by the number of times system failures occurred in that period.

For example, let’s calculate the MTBF of a site for the last 24 hours. Imagine the system had four failures and it took 15 minutes every time to fix it and bring the site back up.

To calculate the MTBF, consider 23 hours of operation time with 1 hour of downtime, due to 4 failures. The MTBF will be:

23/4 = 5.75 hours

This implies, on average, the system is up for 5.75 hours before the next failure. Knowing this metric, the SRE team will work towards increasing the MTBF to make the system more reliable.

Mean time to repair (MTTR)

The mean time to repair is a metric that measures the average time it takes to repair or recover from a failure or incident. It measures how long it takes to repair a system or component after it has failed.

The MTTR helps assess the efficiency and effectiveness of incident response and recovery processes. A lower MTTR indicates a more efficient repair process. A higher MTTR indicates a less efficient repair process.

MTTR can be calculated by dividing the number of hours taken to repair a system by the number of times a system is being repaired.

Let’s say a system was down for repair 15 times in 3 months and the total hours it took to restore a system was 30 hours. This means customers faced a downtime of 30 hours in the last 3 months.

The MTTR can be calculated by:

30 / 15 = 2 hours

This implies, on average, the SRE team’s repair process takes 2 hours to get the system up whenever a failure occurs.

Availability

SRE defines availability as whether a system can fulfill its intended function at any given point in time. It is a measure of how long the system or service is available for users. Availability usually refers to the uptime a site is running.

The availability of a system can be measured in two different ways:

Time-based availability — Calculated as the availability of the system in terms of time. Used to calculate the availability of a single service. To calculate:
Time-based availability = uptime / total time (where total time is uptime + downtime of the system in a given period)
Aggregate availability — When a system consists of multiple services and sites, it’s not possible or insightful to calculate uptime based on time-based availability. The aggregation of all uptimes will help you understand how long the system is available. Aggregate availability can be calculated as:
Aggregate availability = number of successful requests / total number of Requests

Observability

Observability is the ability to gain insight into the internal state and behavior of a system based on its external outputs and interactions. In SRE, it plays a crucial role in understanding, diagnosing, and troubleshooting complex systems.

SRE teams collect and analyze metrics, logs, and traces to gain a deeper understanding of how their systems are performing. This enables engineers to identify and fix problems before they cause any failures.

Observability has three main pillars:

Metrics — Measure system health, such as CPU usage, memory usage, and response time
Logs — Record system events, such as errors and requests
Traces — Record requests flow through a system, including the time it takes for each request to complete

Response time

Response time is the time it takes for an engineer to respond to an alert or incident. This is a critical metric as it has a significant impact on the availability and performance of systems.

SRE teams can improve response time by investing efforts in diagnosing delays or inefficiencies in responding to incidents, automating almost everything possible, providing training and education, and creating a feedback loop.

Latency

Latency is the time it takes to serve a user request by a system or service. It represents the delay in the system to respond to user queries. SRE teams should keep calculating the response time a system or service takes to serve any request.

SRE teams calculate the latency of successful and failed requests separately rather than as a whole. It is possible, an error request might be faster in responding than a successful request.

Error rate

The rate of requests that fail for a given period of time either explicitly via server or network issues, implicitly via code errors, or through SLOs via SLA breach. SRE teams monitor these errors separately and work towards reducing these errors.

Error budgets

An error budget represents the number of acceptable errors or downtime that a system or service can experience within a defined time period, typically measured as a percentage.

An error budget is calculated by subtracting the system’s SLO from 100 percent. For example, if a system has an SLO of 95 percent, its error budget is 5 percent. This means that the system can experience 5 percent of errors that customers can or are ready to tolerate.

Error budgets help SRE teams decide on prioritizing resources between new feature delivery and reliability improvements.

Saturation

Saturation refers to how much load or requests a service can handle. SRE teams use saturation to understand how many requests a service can take. SRE teams answer questions such as whether a service can respond the same on 10 percent more load than usual.

It’s possible that a system degrades in performance for even 50 percent of its capacity. SRE teams measure saturation to develop efficient resource utilization in systems. When a system reaches saturation, it can experience increased latency, reduced performance, or increased errors.

Hence, it is critical for SRE teams to measure saturation and define the capacity the service can handle.

Core principles of SRE

The main goal for building SRE is customer satisfaction. Any organization that has or wants to build an SRE team should adhere to these seven core principles:

Embracing and managing risk
Eliminating toil
Monitoring
Release engineering
Automation
Simplicity

Embracing and managing risk

This principle states two important factors to learn and implement:

Your customers understand that no system can be 100 percent reliable — As an organization, you have to embrace risk and allow systems to fail based on agreement. Embracing risk allows your engineering team to let systems fail as an opportunity to increase reliability
After one point, no customers care if the site is reliable — Customers are unhappy only when your system’s unreliability causes them pain. After a certain point, spending on increasing system reliability is unnecessary and doesn’t add any value to your customers. Hence, it is important for an organization not to overspend on reliability

Instead of trying to eliminate risk entirely, teams should focus on identifying and managing risks. SRE teams should focus on potential failure points, developing mitigation strategies, and monitoring systems for any signs of trouble.

Eliminating toil

Think of a large system connected to hundreds of devices at a customer’s location. Every time a new device is purchased by a customer, a system engineer (operations) will have to visit the customer site, register the device to the system, and update its software.

In terms of scale, if there are 400 customers and every customer is located in a different location, you will have to hire operation resources as and when volume increases.

Toil is repetitive, manual work that doesn’t add any value. SRE teams should work to automate as much toil as possible, freeing up their time to focus on more strategic tasks. Eliminating repetitive work helps organizations scale with the same or lessened cost of maintaining a system or service.

By automating, SRE teams save at least 50 percent of their time in performing these repetitive tasks and focus more on other strategic assignments.

Monitoring

Monitoring systems is the process of collecting and analyzing data about a system to identify and address potential problems. SRE teams use monitoring data to ensure that systems are reliable, scalable, and secure.

There are four key aspects of monitoring:

Choosing the right metrics — Not all metrics are created equal and not all metrics help you gain valuable insights into your system. It’s tempting to log everything a system or service is producing, but this data would be misleading. SRE teams need to choose metrics that are relevant to the specific system they are monitoring
Collect data in real-time — Collect data in real-time to identify and address problems as soon as possible. This can be done using a variety of tools, such as monitoring agents, cloud-based monitoring services, and log analysis tools
Analyze data for trends — Identify potential problems before they become major outages. This can be done using a variety of tools, such as statistical analysis tools, machine learning tools, and anomaly detection tools
Take action when needed — Address potential problems. This might involve rolling back a change, restarting a service, or investigating the issue further

Release engineering

Release engineering is the process of safely and reliably deploying changes to production systems. SRE teams work with development teams to develop and implement a release process that minimizes risk and disruption. Release engineering focuses on the management, coordination, and automation of the software release process.

A good release process would:

Adapt version control — Version controls provide a history of changes, version tagging, and rollback points. Modern version control systems can be easily integrated with continuous release platforms, which helps SRE attain smooth and stable releases to the live environment
Involve and coordinate releases — SRE engineers coordinate with release management to ensure smooth and stable release of service is achieved without impacting any stakeholder
Involve in testing and validation — Work with test engineers to ensure complete regression either via automation or manual testing so that defined quality standards are met
Automate everything — Work with DevOps to automate the entire release pipeline and streamline the complete release process. This reduces the potential human errors that could have been possible when done manually
Define and implement rollout strategies — Define rollout strategies and help minimize the impact of software releases on system reliability. SRE teams ensure their stakeholders have minimal impact
Introduce rollback and recovery in release — Having a rollback plan and being ready with the procedures before rollout is crucial for reducing impact. Plan and develop in case you need to roll back to a previous stable version
Monitor and support post release — After release, SRE teams monitor the system closely and detect any issues that can breach defined SLO. SRE teams provide complete support to customers and communicate proactively on issues that can happen or occurred due to release and state the resolution

By incorporating these release engineering principles into their overall SRE practices, teams can effectively manage software releases, minimize disruptions, and maintain system reliability during the deployment process.

Automation

Implementing automation within your product team:

Reduces toil
Improves reliability
Speeds up the release process
Improves development velocity
Increases efficiency otherwise done by humans with repetitive errors
Frees up engineers for more strategic work

Many common tasks are automated by SRE teams. Some to list are:

Incident response tools
Alerting tools
Monitoring tools
Release tools. Continuous integration and continuous delivery (CI/CD) tools
Test automation tools
Log analyzers

SRE teams can use a variety of automation techniques, such as:

Scripting
Infrastructure as code
Declarative configuration management
Continuous testing
Chaos engineering

Automation is a continuous process. As systems evolve, SRE teams must continue to automate new tasks and improve existing automation.

Simplicity

Simple things are easy to understand, maintain and improvise. If a system has a complex architecture, you will always need creators around to fix if anything goes wrong.

Simplicity is one of the core principles of SRE. It is the idea that systems should be designed and built in a way that is easy to understand, maintain, and operate.

Keeping systems simple allows for:

Increased reliability — Simple systems are less likely to fail because they are easier to troubleshoot and fix
Reduced costs — Less maintenance and overhead operations saves money
Improved agility — Updates become easier, which can help businesses respond quickly to changes in the market
Less toil — Eliminate unnecessary services and processes around them

You can achieve simplicity in your systems by:

Developing a culture of simplicity within an organization — Create an environment where employees are committed to designing and building solutions in a simple way. Embrace the use of simple terms and simplicity of design
Using microservices architecture wherever possible — Breakdown the system into small, independent services responsible for performing a small, related set of actions. Microservices are easier to maintain and can be faster in adapting any change
Using automation wherever possible — Automate static to reduce the amount of manual work required to manage a system. Automating will free up engineers and help them to focus on more strategic tasks
Using a design-first approach — Invest enough time in solutioning a system and do multiple reviews and refinements before starting implementation
Getting feedback from users — During the process of development consult and review with various stakeholders to make the system as simple and adaptable as possible. After delivery, learn from the people who are using the system about the usage and value it is adding to them

Site reliability engineering roles

Organizations can define various roles under their SRE practice group per the need or demand they have in their environment, but you will find these common roles any organization:

SRE engineer — Every SRE team will have one or more SRE engineers. SRE engineers are responsible for the day-to-day operations of the SRE team such as monitoring, incident management and response, and automation
SRE manager — SRE managers oversee the SRE team and are responsible for setting goals, developing processes, and ensuring that the team is meeting its objectives
SRE architect — SRE architects design and implement new systems and processes for the SRE team. They are also responsible for ensuring that the SRE team’s work is aligned with the overall goals of the organization
SRE developer — SRE developers write code to automate tasks, improve reliability, and add new features to the SRE team’s systems
SRE tools engineer — SRE tools engineers develop and maintain the tools that the SRE team uses to do their work. This role can be performed by SRE engineer in smaller organizations

What is the difference between DevOps and SRE?

At the moment, there’s a debate to differentiate SRE and DevOps. Some say they’re the same thing with different names. Some say DevOps is a broader framework and SRE is part of it.

To understand the difference between the two, see the following:

Area	DevOps	SRE
Primary Focus	DevOps focuses on the collaboration and integration of development (Dev) and operations (Ops) teams to streamline software delivery and improve the overall software development lifecycle	SRE focuses on the reliability, availability, and performance of software systems and services. It aims to ensure that systems meet defined service level objectives (SLOs) while enabling innovation and scalability
Primary Work Area	DevOps work from product development to delivery. They streamline the entire operations of development, review, and release of software. They primarily work with development teams	SRE teams work on maintaining the reliability of systems that are live. They are involved with pre-delivery projects with software teams at a very minimal level to help software teams develop reliable systems. They primarily work with operation teams
Primary Responsibility	DevOps solve development problems, build and release problems, downtime, and deployment problems	SRE teams solve production failures, incident analysis and responses, production health, and overall reliability of systems
Cross-functional collaboration	DevOps is based on org size and can be a separate team or work in the development team itself. However, DevOps engineering collaborates with cross-functional teams to achieve smoother software delivery	SRE is a separate group of SRE engineers built in any org with the goal to achieve systems reliability, performance, and quality in production systems
Primary Goal	Develop and deliver new features to production faster in a smoother process	Ensure SLOs are met with customer agreements and no failures occur in systems running live
Tools Used	DevOps mostly use CI/CD tools like Jenkins, Jira, Git, Circle CI, etc. for attaining continuous smoother integration and release	SRE on the other hand uses tools like Prometheus, Grafana, Graylog, Skywalk, etc. to observe, monitor, analyze and fix reliability in systems

Benefits of SRE

There are many benefits to having a fully functional SRE team built in your organization. The main reason for building SRE is to bridge the gap and solve the disconnect between development and operation teams. Alongside this, SRE contributes to:

High customer satisfaction — SRE focuses on reliability, response times, and performance of systems, which in turn gain high customer satisfaction.
Increased business value — Highly available, scalable and reliable systems increase user satisfaction, improve customer retention, and generate more business value
Reduced costs — Automation, infrastructure resource optimization, and strong incident management helps in reducing the costs associated with IT operations
Efficient resource utilization — Identifying inefficiencies and over consumption helps make better use of infrastructure and avoid unnecessary costs
Improved system reliability — Implementing best practices in monitoring, incident management, error handling, security, and automation helps in reducing downtime, avoiding service failures, and enhancing overall system availability
Faster incident response and recovery — Implementing best practices in incident management, automating processes, and developing playbooks to share knowledge enables faster detection, diagnosis, and resolution of issues
Scalability and performance optimization — Planning capacity and optimizing resource utilization helps identify and solve potential scalability bottlenecks, increase performance and build resilient systems
Collaboration and alignment — Sharing reliability responsibilities with the development teams, DevOps teams, as well as coordinating with every stakeholder helps improve communication, cooperation, and alignment among teams
Continuous improvement and learning culture — SRE teams help embrace continuous improvement and develop a learning culture by adapting new technologies and practices

How should a product manager practice SRE?

Product managers spend their time focusing on keeping their customers content with new features, enhancing existing features, enriching user experience, and providing high-quality services to their customers.

You can leverage the expertise that SRE teams bring in to manage the products and services you offer to your customers.

As a PM you can practice SRE by:

Understanding the SRE principles — For PMs, SRE is a relatively new discipline and it’s important to spend some effort on learning its underlying principles. Understanding these principles will help you consider reliability in product requirements, treat systems as code, and encourage automation as the default mode of operation
Collaborate with SRE teams and involve in SRE activities — Engage SRE teams in the product development process to identify potential reliability challenges and ensure that non-functional requirements are adequately considered
Measure reliability — With all the other metrics that are tracked and measured, PMs should also measure the reliability of the product on a regular basis. Ensure that the SRE team is collecting metrics such as uptime, response time, and error rates and that all these meet the agreed SLOs
Define service level objectives (SLOs) — This is critical for any product’s success. Define SLOs that align with customer expectations and business objectives. Product managers should monitor these SLOs and regularly review the metrics with the SRE team to ensure the product is meeting the defined reliability goals
Prioritize reliability together with user experience — Invest time and resources in ensuring the product is designed, developed, and tested in a way that minimizes the risk of outages and performance problems
Ensure strong monitoring and feedback systems are in place — You should collaborate with SRE teams to establish effective monitoring and feedback systems for your product. This will help you to identify potential issues, prioritize improvements, and make data-driven decisions
Involvement in post-mortem reviews — Involve yourself in after-incident reviews to assess the impact on the customer, gain insights into product vulnerabilities, prioritize reliability improvements, and contribute to the overall reliability of the product
Foster a culture of reliability — Prioritize reliability and ensure SRE team along with the development, testing, and DevOps teams imbibe thinking and designing reliable systems in their operations

Summary

SRE is a discipline practiced to solve reliability, performance, and quality issues that customers face on large-scale, distributed systems. It is critical to businesses that systems are always available to their customers and functioning.

SRE teams monitor production systems, automate repetitive tasks, analyze incidents, and ensure uptime, latency, performance, and reliability issues are solved. As a PM, one should work closely with the SRE team to define SLOs and SLIs to track and measure the reliability of the system.

Remember, the SRE is a continuous process and PM’s continuous involvement in SRE is very important in maintaining site reliability.

Featured image source: IconScout

LogRocket generates product insights that lead to meaningful action

LogRocket identifies friction points in the user experience so you can make informed decisions about product and design changes that must happen to hit your goals.

With LogRocket, you can understand the scope of the issues affecting your product and prioritize the changes that need to be made. LogRocket simplifies workflows by allowing Engineering, Product, UX, and Design teams to work from the same data as you, eliminating any confusion about what needs to be done.

Get your teams on the same page — try LogRocket today.

Pivot or perish: Why product agility determines market survival

Great product managers spot change early. Discover how to pivot your product strategy before it’s too late.

Bartosz Jaworski

Jul 2, 2025 ⋅ 6 min read

Leader Spotlight: Human empathy in the age of AI, with Thach Nguyen

Thach Nguyen, Senior Director of Product Management — STEPS at Stewart Title, emphasizes candid moments and human error in the age of AI.

Marta Randall