Documenting Root Cause Analysis
Inevitably in the world of systems something will break and a “Root Cause Analysis (RCA),” “Incident Analysis” or “After Actions” document will need to be written. Many otherwise capable IT types often freeze at the very thought of documenting an issue, and in this post, we’ll cover an easy format to follow.
Documenting root cause analysis around an incident starts with keeping good notes during an incident. I jot down the time and any facts I want to remember for later. Any metrics pertinent to the issue should also be recorded (such as transaction volumes, CPU usage, throughput or impacted systems/users.)
There are four major sections to an RCA document. We’ll explore each in detail:
- Executive Summary – This is the high level version of what happened. Since this goes to executives, and many times is the only thing they’ll read, it needs to be clear, concise, and jargon free. I find it is useful to assume the executive reading this may not have a technical background, so keeping it high level helps.
- While this is always the first thing in a RCA document, I find it is often easier to write this last…once all the pertinent facts are understood.
- Impact - Identify the impact in terms business people can relate to. Some organization count user outage minutes (number of users x length of outage), “not able to process any orders for 30 minutes”, etc. Some businesses will sustain minor impact from an outage if their customers are captive (such as online banking being down for a bank.) Recurring issues will impact business.
- Timeline – The timeline needs to show the major activities from the beginning of the issue to the resolution/mitigation. While the notes taken during the event are useful, any log entries in systems, notes in service desk systems, or emails are often useful for time stamping.
Depending on the duration of the issue, the amount of detail included in the timeline will need to be adjusted. A second by second analysis isn’t needed unless relevant to the issue.
Once the timeline is constructed, review for any improvement opportunities Large incidents often take time to “declare” because the engineers are looking at individual symptoms and not gaining insight to overall patterns. There are often very valuable learnings obtained from timeline analysis.
- Issues – When a vendor is asked for a Root Cause Analysis, they often identify a single topic and the associated root cause. While important, there are often many issues in a given incident, and executive management will look to the author (and/or team) to provide all issues.
On any given issue, engineers often provide a first order analysis of the issue, and have not identified root cause. “High CPU” as the root cause for a performance issue is rarely the root issue.
To get to the root cause, one technique is to ask “WHY” five (or more) times.
For example….
Problem: poor performance
1 Why – High CPU
2 Why – The application was in a loop
3 Why – The database connection was lost, and the application kept retrying
4 Why – The network had an issue
5 Why – Switch supervisor failure
Only when the answers to the “whys” are exhausted will the root cause become apparent and a corrective action plan put into place.
BTW…it’s my experience the most common RCA from a communications carrier is NTF (No Trouble Found.)
- Corrective Action Plan/Mitigations With root cause in hand and clarity around the issues, a corrective action plan can be devised. As with any plan, the task, duration and resource should be identified. Sometimes the corrective action will be completed, other times it will spawn a project (often related to a budget consideration.)
Tasks from Corrective Action Plans need to be managed like any effort.
It’s very important sufficient time be put into developing the RCA and associated corrective action plans. These documents have a way of taking on a life of their own, and often find their way into internal or external auditor hands.
Be fully truthful, and not alarmist or inflammatory, in your analysis.
How an organization reacts to a crisis is very important, and the RCA is a big part of it.
Reader Comments