SRE

SRE play a crucial role in ensuring the reliability, scalability, and performance of our production systems

You will work closely with development and operations teams to automate tasks, improve monitoring and observability, and drive continuous improvement in our infrastructure and processes.

Tasks

Automate repetitive tasks in the production environment using scripting languages like Python and configuration management tools to improve efficiency and reduce manual effort.
Develop and maintain monitoring and observability tools, integrating production applications with platforms like Splunk, ELK, AppDynamics, Evolven, or ITRS. Configure alerts and dashboards to proactively identify and address potential issues, ensuring comprehensive system visibility.
Conduct thorough root cause analysis of production incidents, identifying patterns and suggesting solutions for permanent or temporary fixes. Proactively identify potential issues and implement preventative measures.
Champion SRE best practices within the organization, advocating for improvements in monitoring, alerting, automation, and incident response processes.
Continuously learn and stay up-to-date with the latest technologies and trends in SRE.

An error budget is the amount of time a system can be unavailable without violating the agreed SLA/SLO.

The 52.56 minutes in a year for 99.99% availability is your error budget — the maximum allowable downtime within the SLA.

If SLO is 99.99%, your error budget is 0.01%

acceptable downtime

pRINCIPLES AND cONCEPTS

https://sstechsystemofficial.medium.com/what-are-the-type-of-logistics-app-221a62719b1c

identifying the appropriate level of tolerance for the services we run Doing so allows us to perform a cost/benefit analysis to determine We strive to make a service reliable enough, but no more reliable than it needs to be. clean up technical debt, or reduce its operational costs

Measuring Service Risk + As standard practice identifying an objective metric to represent the property of a system we want to optimize By setting a target, we can assess our current performance and track improvements or degradations over time.

Service failures can have many potential effects, including user dissatisfaction, harm, or loss of trust; direct or indirect revenue loss; brand or reputational impact; and undesirable press coverage

the most straightforward way of representing risk tolerance is in terms of the acceptable level of unplanned downtime. Unplanned downtime is captured by the desired level of service availability, usually expressed in terms of the number of "nines" we would like to provide: 99.9%, 99.99%, or 99.999% availability. Each additional nine corresponds to an order of magnitude improvement toward 100% availability. For serving systems, this metric is traditionally calculated based on the proportion of system uptime

measuring riks https://sre.google/sre-book/embracing-risk/#risk-management_measuring-service-risk_time-availability-equation

For example, a system with an availability target of 99.99% can be down for up to 52.56 minutes in a year and stay within its availability target

designed a system that serves 2.5M requests in a day with a daily availability target of 99.99% can serve up to 250 errors and still hit its target for that given day.

. In many cases, however, availability calculated as the request success rate over all requests is a reasonable approximation of unplanned downtime, as viewed from the end-user perspective.

batch process that extracts, transforms, and inserts the contents of one of our customer databases into a data warehouse to enable further analysis may be set to run periodically.

service level indicators (SLIs), objectives (SLOs), and agreements (SLAs) metrics that matter, what values we want those metrics to have, and how we’ll react if we can’t provide the expected service

SLI, SLO, SLA https://sre.google/sre-book/service-level-objectives/

Most services consider request latency—how long it takes to return a response to a request—as a key SLI error rate, system throughput, Another kind of SLI important to SREs is availability, the fraction of the time that a service is usable.

OBJECTIVES= An SLO is a service level objective: a target value or range of values for a service level that is measured by an SLI

LATENCY Shakespeare search results "quickly," adopting an SLO that our average search request latency should be less than 100 milliseconds. Choosing an appropriate SLO is complex

you can say that you want the average latency per request to be under 100 milliseconds,

setting such a goal could in turn motivate you to write your frontend with low-latency behaviors of various kinds or to buy certain kinds of low-latency equipment. (100 milliseconds is obviously an arbitrary value, but in general lower latency numbers are good. There are excellent reasons to believe that fast is better than slow, and that user-experienced latency above certain values actually drives people away— see "Speed Matters"

achieved using datawarehouse using clickhouse

higher QPS often leads to larger latencies, and it’s common for services to have a performance cliff beyond some load threshold. so rate limiting comes in pictures

Choosing and publishing SLOs to users sets expectations about how a service will perform

Agreements Finally, SLAs are service level agreements: an explicit or implicit contract with your users that includes consequences of meeting (or missing) the SLOs they contain. The consequences are most easily recognized when they are financial—a rebate or a penalty—but they can take other forms. An easy way to tell the difference between an SLO and an SLA is to ask "what happens if the SLOs aren’t met?

SRE doesn’t typically get involved in constructing SLAs, because SLAs are closely tied to business and product decisions. SRE does, however, get involved in helping to avoid triggering the consequences of missed SLOs

why choosing appropriate metrics to measure your service is important, how do you go about identifying what metrics are meaningful to your service or system?

monitoring is important to measure the service’s alignment with business goals

Choosing too many indicators makes it hard to pay the right level of attention to the indicators that matter, while choosing too few may leave significant behaviors of your system unexamined.

User-facing serving systems, such as the Shakespeare search frontends, generally care about availability, latency, and throughput. In other words: Could we respond to the request? How long did it take to respond? How many requests could be handled?

Storage systems often emphasize latency, availability, and durability. In other words: How long does it take to read or write data? Can we access the data on demand? Is the data still there when we need it?

Big data systems, such as data processing pipelines, tend to care about throughput and end-to-end latency. In other words: How much data is being processed? How long does it take the data to progress from ingestion to completion?

Start by thinking about (or finding out!) what your users care about, not what you can measure. 99% of Get RPC calls will complete in less than 100 ms

If you have users with heterogeneous workloads such as a bulk processing pipeline that cares about throughput and an interactive client that cares about latency, it may be appropriate to define separate objectives for each class of workload:

95% of throughput clients’ Set RPC calls will complete in < 1 s. 99% of latency clients’ Set RPC calls with payloads < 1 kB will complete in < 10 ms.

it is better to allow an error budget—a rate at which the SLOs can be missed—and track that on a daily or weekly basis

An error budget is just an SLO for meeting other SLOs! https://sre.google/sre-book/embracing-risk#xref_risk-management_unreliability-budgets

Publishing SLOs sets expectations for system behavior.

development performance is largely evaluated on product velocity, which creates an incentive to push new code as quickly as possible. Meanwhile, SRE performance is (unsurprisingly) evaluated based upon reliability of a service,

The error budget provides a clear, objective metric that determines how unreliable the service is allowed to be within a single quarter. This metric removes the politics from negotiations between the SREs and the product developers when deciding how much risk to allow.

Product Management defines an SLO, which sets an expectation of how much uptime the service should have per quarter. The actual uptime is measured by a neutral third party: our monitoring system. The difference between these two numbers is the "budget" of how much "unreliability" is remaining for the quarter. As long as the uptime measured is above the SLO—in other words, as long as there is error budget remaining—new releases can be pushed. For example, imagine that a service’s SLO is to successfully serve 99.999% of all queries per quarter. This means that the service’s error budget is a failure rate of 0.001% for a given quarter. If a problem causes us to fail 0.0002% of the expected queries for the quarter, the problem spends 20% of the service’s quarterly error budget.

The main benefit of an error budget is that it provides a common incentive that allows both product development and SRE to focus on finding the right balance between innovation and reliability.

Many products use this control loop to manage release velocity: as long as the system’s SLOs are met, releases can continue. If SLO violations occur frequently enough to expend the error budget, releases are temporarily halted while additional resources are invested in system testing and development to make the system more resilient, improve its performance, and so on. More subtle and effective approaches are available than this simple on/off technique:15 for instance, slowing down releases or rolling them back when the SLO-violation error budget is close to being used up.

f you exceed your (error) budget, you’re violating the SLO.

What Is an Error Budget? It's the acceptable amount of unreliability your system is allowed, based on the SLO.

If SLO is 99.99%, your error budget is 0.01% of time (or failed requests, etc.) in a given period.

Use for Postmortems When an incident happens, how much budget did it consume?

Helps prioritize severity and track whether SLOs were violated.

Let’s say:

Monthly error budget: 4.3 minutes

Deployment caused 3 minutes of 500 errors

Only 1.3 minutes left → you decide:

Pause further deployments

Rollback risky changes

Prioritize reliability tickets

SLI/SLO dashboards in Prometheus + Grafana

Error budget burn rate alerts

SRE tools: Nobl9, Sloth, Google's SLO Generator, etc.

Runbook (a.k.a. Playbook) A runbook is a step-by-step guide for handling known operational tasks or incidents. It's designed for quick, reliable action during incidents or alerts.

RUNBOOK VS COOKBOOK

Great question, Pavan! In the SRE (Site Reliability Engineering) world, runbooks and cookbooks are essential parts of operational documentation — but they serve slightly different purposes.

📘 Runbook (a.k.a. Playbook)

A runbook is a step-by-step guide for handling known operational tasks or incidents. It's designed for quick, reliable action during incidents or alerts.

🔹 Purpose:

To reduce MTTR (Mean Time to Recovery)
To standardize responses
To allow on-call engineers (even junior ones) to respond quickly

🔹 Example Use Cases:

Restarting a failed Kubernetes pod
Handling high CPU alert
Database failover process
Rolling back a bad deployment

🔹 Typical Format:

Title: High CPU Usage on Node

Trigger: Alert from Prometheus: node_cpu_usage > 90% for 5 min

Steps:
1. Check CPU usage using `top` or `htop`
2. Identify high-usage process
3. Restart pod if needed: `kubectl delete pod <pod-name>`
4. If recurring, scale app or adjust HPA
5. Document what was done in incident log

📙 Cookbook

A cookbook is more like a collection of recipes or guides for how to perform common operational tasks — often covering setup, configuration, and maintenance rather than emergencies.

🔹 Purpose:

To help team members learn how to perform tasks
For reference when provisioning or configuring systems
Useful in onboarding and knowledge sharing

🔹 Example Use Cases:

Setting up Karpenter on EKS
Creating a new GitHub Actions CI/CD pipeline
Provisioning a new RDS instance
Rotating AWS secrets

🔹 Typical Format:

Title: Setup Karpenter on EKS

Steps:
1. Install Helm chart using version `vX.X.X`
2. Create IAM roles with proper permissions
3. Define a `Provisioner` CRD with spot instance strategy
4. Verify Karpenter controller logs

🧠 Summary Table

Feature

Runbook

Cookbook

Focus

Incident response

Setup/config/maintenance

Time of Use

During outages/alerts

During planning/standard tasks

Style

Reactive, procedural

Instructional, how-to

Audience

On-call engineer

Any DevOps/SRE team member

Example

How to recover from pod crash

How to deploy Superset on K8s

Let me know if you want help writing a sample runbook or cookbook for something you've already done (like Karpenter setup or Superset backup).

Previousdockerise NextAutoation

Last updated 5 months ago

hashtagService failures can have many potential effects, including user dissatisfaction, harm, or loss of trust; direct or indirect revenue loss; brand or reputational impact; and undesirable press coverage

hashtagservice level indicators (SLIs), objectives (SLOs), and agreements (SLAs) metrics that matter, what values we want those metrics to have, and how we’ll react if we can’t provide the expected service

hashtag📘 Runbook (a.k.a. Playbook)

hashtag📙 Cookbook

hashtag🧠 Summary Table

Service failures can have many potential effects, including user dissatisfaction, harm, or loss of trust; direct or indirect revenue loss; brand or reputational impact; and undesirable press coverage

service level indicators (SLIs), objectives (SLOs), and agreements (SLAs) metrics that matter, what values we want those metrics to have, and how we’ll react if we can’t provide the expected service

📘 Runbook (a.k.a. Playbook)

📙 Cookbook

🧠 Summary Table