Async Postmortems in a Global, Remote Team
9 minute read
Shipping is scary.
Doing our jobs as engineers requires going to work every day and pressing a big green button that could, in principle, break everything. For Ashby to succeed as a company, we must press that button as often as possible.
The systems we employ to make pressing that button safe — static types, automated tests, code review, et cetera — weren't invented here. They are the legacy of decades of software engineers who made terrible mistakes, learned from their failures, and documented what they learned to build a safer environment for all of us.
Ashby's postmortem process is crucial to continuing that legacy of improvement through failure. But just as importantly, our collaborative and blameless postmortem culture also helps us feel safe, knowing that we'll have the team's support when something inevitably goes wrong. Together we'll work through the problem, understand what failed, and build resilient systems that will stop it from happening again.
Google's SRE Book summarizes the basic concept of a blameless postmortem: "A postmortem is a written record of an incident, its impact, the actions taken to mitigate or resolve it, the root cause(s), and the follow-up actions to prevent the incident from recurring." If you've worked as an engineer, you're undoubtedly already familiar with this idea.
In this article, I'll talk about the collaborative postmortem process we use:
- How we organize postmortems across time zones
- How we decide who should run a postmortem (and when)
- How we apply blamelessness to root cause analysis
- Where we still have room to grow
Postmortems Are Living Documents
At Ashby, a postmortem is a document created in our internal wiki (we use Slab), after resolving an incident. It serves as a record that outlines the nature of the incident and provides a timeline of relevant events. It is also the central location for collaborative discussion of the events, root cause analysis, and any learnings or next steps.
Our postmortems live in shared documents. This allows us to contribute to the timeline and evidence collaboratively and use comments to tag participants and discuss root cause analysis. Collaborating through a document helps us avoid costly meetings while ensuring that every stakeholder has a chance to contribute to the process —even those working asynchronously from a different time zone. (For more on our collaborative philosophy, see How Thoughtful Communication Makes Us Unreasonably Productive!)
To kick off the postmortem process, an engineer starts by making a copy of our postmortem template. They'll gather initial evidence, fill out a basic timeline, and tag any participants in the incident. Then, they send the draft around to the team. From there, we work asynchronously to complete the timeline, ask the questions necessary for root cause analysis and document any learnings or next steps.
Once we identify next steps, each one is ticketed as a GitHub Issue and added to our Postmortem Action Items GitHub Project. Then, we link each ticket from the postmortem document to easily see that everything was ticketed. The tickets go to an engineering manager to triage, prioritize, and assign.
Quick fixes, such as adding a lint rule or a DataDog monitor, are often performed throughout writing the postmortem as soon as they're identified. Often these were already done during the incident response. In that case, we note that the task has already been done and does not need to be ticketed.
Postmortem Owner Rotation
Who's responsible for doing all that? Ashby's flat organizational structure and focus on autonomy help us move faster with fewer resources than our competitors. Still, it presents a problem for postmortem analysis: we don't have a dedicated SRE team that can afford to investigate and document every incident. To write a postmortem, some engineer needs to take on the task.
In the early days of Ashby, this would often mean the engineer "responsible" for the incident would write up the postmortem. There are a few issues with that habit:
- To know who should write the postmortem, we need to know what caused the incident. But determining the root cause of the incident is the purpose of the postmortem!
- These postmortems would often start with a lot of "I" statements: "I shipped commit ead673a", "I misunderstood the documentation," and so on. "I" statements go against the basic tenet of blamelessness, as noted in PagerDuty's The Blameless Postmortem: "The goal of the postmortem is to understand what systemic factors led to the incident and identify actions that can prevent this kind of failure from recurring. A blameless postmortem stays focused on how a mistake was made instead of who made it."
- We take pride in the quality of our work, and when we ship broken code, it's natural to feel some guilt and personal responsibility for the consequences. Being responsible for organizing the postmortem adds a burden to the person who is already most psychologically affected by the incident. As the SRE Book writes: "Writing a postmortem is not punishment —it is a learning opportunity for the entire company."
To address these issues, we decided to switch to a round-robin system for distributing postmortem ownership. When an incident happens, the next engineer in the rotation is responsible for gathering initial evidence, creating the Slab document, and facilitating collaboration and root cause analysis within the team. That helps us distribute the load of organizing postmortems while also engaging more engineers in the postmortem process.
That has a few additional benefits. For one, it helps us combat confirmation bias —"the tendency to look for and favor information that supports one's own hypothesis". For example, in one recent incident, the on-call engineers noticed suspiciously timed anomalies in our Postgres metrics and concluded that a database issue was likely the culprit. Other evidence supported that theory. During the postmortem, another engineer discovered that the root cause was a network split in our Redis provider, and the database issue was completely unrelated. Often a theory of the root cause is built by first responders during triage, and having a third party reevaluate the sequence of events with fresh eyes can be invaluable.
Furthermore, involving more engineers helps us share knowledge, including specific information about our infrastructure and basic SRE skills. That helps us prepare for future incidents and enables any engineer to help during a response to an incident. The more skill we spread across the team, the better.
When to Write a Postmortem
Each postmortem involves several engineers gathering evidence, analyzing, and assigning follow-up tasks. We want to reserve postmortems for serious and complex incidents that warrant the investment of engineering time. At Ashby, we don't formally assign a severity level to incidents (unless it's a security incident, which involves the security team and has its own SLA). For incidents that don't have security implications, any member of the team can request a postmortem for an incident. We use some guidelines to help determine when we'd generally want to write a postmortem:
- Incidents that involve loss (or potential loss) of customer data
- Incidents involving substantial or long-lasting degradation of core features, or any app downtime
- Incidents that block key workflows for multiple customers
- "Incidents" that turned out to be false alarms but still paged or caused extra work for the on-call engineer
When a postmortem isn't warranted, we still encourage engineers to write retrospective documents and distribute learnings to the team. However, out of respect for the postmortem process, we consider these documents "Retrospectives" rather than "Postmortems." The word choice makes it clear to everyone when we follow our exact postmortem process.
Root Cause Analysis
A good root cause analysis should answer the question, What could we change to prevent a similar incident from happening in the future? The answers to that question become our follow-up tasks.
We often start our analysis with the Five Whys: after answering each question about the incident, ask "Why?" again until you arrive at a root cause. However, we're also sensitive to the tendency for such a linear series of questions to terminate at "first stories," i.e., stories revolving around human error. As John Allspaw writes in The Infinite Hows:
"When it comes to decisions and actions, we want to know how it made sense for someone to do what they did. And make no mistake: they thought what they were doing made sense. Otherwise, they wouldn't have done it."
So, we don't strictly follow any formal approach. Instead, we ask open-ended questions that seek diverse narratives, digging into the superficial causes of a problem to find the systemic or procedural failures which allowed the problem to become as serious as it did.
Finding the root cause is always a collaborative process. We help each other to arrive at a deeper understanding of the incident by being curious, asking good questions, and using blameless language that allows engineers to acknowledge their mistakes comfortably.
Since the purpose of a postmortem is to build better processes, focusing on the actions of individuals can be distracting at best. At worst, it can make engineers feel unsafe when making changes to the codebase or taking decisive action in an emergency.
To combat this, we make a deliberate effort to use blameless language in a postmortem. That means using language that implicitly assumes the good faith of all participants and focuses on failures of systems and documented processes rather than the actions of individual engineers.
In particular, the body of the postmortem never contains the names of individual engineers. Instead, we keep a list of people involved in the incident in a separate Participants section. Rather than saying "@person shipped 2aab2c9", we say "2aab2c9 was shipped". The difference may seem inconsequential —anyone who cares can look up which engineer shipped 2aab2c9. But it can make a big difference to that engineer and help them shift from a defensive posture to a collaborative one.
The goal is twofold: to focus on improvements we can make as a team and provide individuals with a visible sign that their actions will be interpreted in good faith. Using this language tells everyone involved that the team will always support each other in our response to mistakes and emergencies.
That support is essential for engineers who are early in their careers; shipping a major bug for the first time can shatter the confidence of a new engineer, severely hurting both their productivity and well-being. Careful language choice makes it natural to reframe their mistake as a learning opportunity and can even build their confidence to take measured risks in the future.
Room to Grow
Like everything at Ashby, our postmortem process isn't set in stone, and we're constantly updating it in response to feedback from stakeholders. For example, our QA team recently communicated that it wasn't always obvious which issues weren't purely infrastructural and could have been prevented by improvements to our QA process. To help, we've added a section to our postmortem template that describes how the incident manifested to users.
One major ongoing work in progress is how we prioritize issues from past postmortems. Usually, the most urgent improvements are made by participants during the postmortem process, if not during the incident itself. Less urgent issues are often moved to product area backlogs to be addressed by individual teams according to their bandwidth.
But many issues uncovered during postmortems are things that aren't critical today but will likely become urgent or even dangerous in the future —maybe after we scale or make a planned migration to a new piece of infrastructure. For those issues, keeping a close eye on them, and allocating the right resources at the right time, is crucial to preventing incidents. That's a part of the process we're still working on getting precisely right.
Before formalizing this process, our postmortems were conducted ad hoc and inconsistently. Whether we wrote a postmortem and documented (and completed) follow-ups varied greatly depending on who was involved in the incident.
As a result of a more consistent process, we've implemented many improvements to the systems that protect us from our own mistakes —improvements to developer experience, CI, infrastructure, documentation, and more. At the same time, our blameless culture has provided comfort and psychological support to several engineers who have gone through the rite of passage of shipping their first major bug and being at the center of an incident.
Shipping is scary. It won't ever be entirely safe to push that big green button. But at Ashby, we're always working to make it feel a little bit safer.
"Blamelessness" in Aerospace and Medicine
The origins of the concept of a "blameless postmortem" are well known in engineering lore, as the SRE Book states:
"Blameless culture originated in the healthcare and avionics industries where mistakes can be fatal. These industries nurture an environment where every "mistake" is seen as an opportunity to strengthen the system. When postmortems shift from allocating blame to investigating the systematic reasons why an individual or team had incomplete or incorrect information, effective prevention plans can be put in place."
However, it's worth noting that the concept of "blamelessness" applied in aviation and medicine differs from what we practice: mistakes often have personal consequences.
The National Transportation Safety Board, or NTSB, conducts a detailed investigation of every aviation accident in the United States, and their reports and recommendations are widely credited for the incredible safety record of US airlines. But where human error is a factor in an accident, it's fairly common for the FAA to revoke or suspend that individual's license to operate an aircraft.
Likewise, in medicine, where the term "postmortem" has a literal meaning, the role of human error is more nuanced. Preventing future mistakes is essential, but so is punishing negligence or wrongdoing:
"Although a culture that does not place blame was a step in the right direction, it was not without its faults. This model failed to confront individuals who willfully and repeatedly made unsafe behavioral clinical practice choices. Disciplining health care workers for honest mistakes is counterproductive, but the failure to discipline workers who are involved in repetitive errors poses a danger to patients. A blame-free culture holds no one accountable and any conduct can be reported without any consequences. Finding a balance between punishment and blamelessness is the basis for developing a Just Culture."
As with most software engineering teams, the stakes at Ashby are never life or death. These low stakes allow us to lean into a radically blameless culture that prioritizes psychological safety, risk-taking, and high velocity. Mistakes are an acceptable tradeoff for shipping more code faster.