Journal - Gary L Kelley

Entries in incident (3)

Monday

May062013

The Importance of Staff & Shifts

Monday, May 6, 2013 at 1:47PM

In the course of our business, we see many data center/applications migrations and/or high-severity issues. One observation we always share with our clients is to plan for staff rotation. As you might expect, some listen and others do not. Here’s why it’s important.

Migrations often happen overnight…when the business sleeps or operates at a lower activity level. Organizations without satisfactory disaster recovery plans often incur an outage to do a migration. People are resilient for so many hours, and then they crash.

What often happens in migrations is everyone wants to be at the starting line, and the adrenaline keeps them engaged. If shifts are not “forced,” then there is often nobody left with “gas in their tank” to troubleshoot issues. People simply have to disengage to be fresh.

We saw this at a large customer where the team had persevered, declared success, and then dragged themselves home. There was an issue, and the on-call was unwilling to make changes as he didn’t understand the changes that had taken place (a change management issue.) NOBODY involved was responding to calls. As it turned out, the group’s manager lived in my town, and I got to knock on his door at 10:00AM on a Sunday morning. His wife wasn’t happy (he had been up all night) and did indeed get him up. While he resolved the issue, a few months later he resigned and went to work at a different company.

In this case, the team was not structured to focus on a multiple day issue….and response was poor.

In another case, a new virus definitions in client’s antivirus system determined the operating system was bad, quarantining the operating system. The client had a policy to delete quarantined files, so with the speed of automation thousands of operating systems were deleted.

The senior manager quickly determined this would require a sustained 24/7 response, and teams were “nominated” to cover 12 hour shifts. We were asked to help on a sustained basis, providing process oversight and helping with crisply doing turnovers.

To the credit of the senior manager, this approach allowed a sustained response as systems we recovered from (gasp!) tape.

Large IT shops often run with multiple shifts and a technical response is more organic. Smaller shops tend to have an operational capability 24x7, and may lack the detailed technical response.

When planning or reacting to major events, think in terms of how to rotate your staff for a sustained time.

Gary L Kelley | |

Email Article |

Print Article

Documenting Root Cause Analysis

Monday, June 18, 2012 at 8:00AM

Inevitably in the world of systems something will break and a “Root Cause Analysis (RCA),” “Incident Analysis” or “After Actions” document will need to be written. Many otherwise capable IT types often freeze at the very thought of documenting an issue, and in this post, we’ll cover an easy format to follow.

Documenting root cause analysis around an incident starts with keeping good notes during an incident. I jot down the time and any facts I want to remember for later. Any metrics pertinent to the issue should also be recorded (such as transaction volumes, CPU usage, throughput or impacted systems/users.)

There are four major sections to an RCA document. We’ll explore each in detail:

Executive Summary – This is the high level version of what happened. Since this goes to executives, and many times is the only thing they’ll read, it needs to be clear, concise, and jargon free. I find it is useful to assume the executive reading this may not have a technical background, so keeping it high level helps.
- While this is always the first thing in a RCA document, I find it is often easier to write this last…once all the pertinent facts are understood.
Impact - Identify the impact in terms business people can relate to. Some organization count user outage minutes (number of users x length of outage), “not able to process any orders for 30 minutes”, etc. Some businesses will sustain minor impact from an outage if their customers are captive (such as online banking being down for a bank.) Recurring issues will impact business.

Timeline – The timeline needs to show the major activities from the beginning of the issue to the resolution/mitigation. While the notes taken during the event are useful, any log entries in systems, notes in service desk systems, or emails are often useful for time stamping.

Depending on the duration of the issue, the amount of detail included in the timeline will need to be adjusted. A second by second analysis isn’t needed unless relevant to the issue.

Once the timeline is constructed, review for any improvement opportunities Large incidents often take time to “declare” because the engineers are looking at individual symptoms and not gaining insight to overall patterns. There are often very valuable learnings obtained from timeline analysis.

Issues – When a vendor is asked for a Root Cause Analysis, they often identify a single topic and the associated root cause. While important, there are often many issues in a given incident, and executive management will look to the author (and/or team) to provide all issues.

On any given issue, engineers often provide a first order analysis of the issue, and have not identified root cause. “High CPU” as the root cause for a performance issue is rarely the root issue.

To get to the root cause, one technique is to ask “WHY” five (or more) times.

For example….

Problem: poor performance

1 Why – High CPU

2 Why – The application was in a loop

3 Why – The database connection was lost, and the application kept retrying

4 Why – The network had an issue

5 Why – Switch supervisor failure

Only when the answers to the “whys” are exhausted will the root cause become apparent and a corrective action plan put into place.

BTW…it’s my experience the most common RCA from a communications carrier is NTF (No Trouble Found.)

Corrective Action Plan/Mitigations With root cause in hand and clarity around the issues, a corrective action plan can be devised. As with any plan, the task, duration and resource should be identified. Sometimes the corrective action will be completed, other times it will spawn a project (often related to a budget consideration.)

Tasks from Corrective Action Plans need to be managed like any effort.

It’s very important sufficient time be put into developing the RCA and associated corrective action plans. These documents have a way of taking on a life of their own, and often find their way into internal or external auditor hands.

Be fully truthful, and not alarmist or inflammatory, in your analysis.

How an organization reacts to a crisis is very important, and the RCA is a big part of it.

Gary L Kelley | |

Email Article |

Print Article

When to Declare an Incident

Monday, December 5, 2011 at 8:00AM

I was sitting with a client recently having a project discussion when a problem with email was called in.

No big deal. Large companies have problems every day.

A few minutes later another interruption, this time for phones in a remote office. OK, this is why we have staff.

And then another issue, this time with remote access.

The client was calmly processing these facts, and continuing the conversation. Perhaps they felt obligated with me being there. I interrupted, and said, “All these problems. Something larger must be going on. Should you declare an incident?”

Large IT shops are very familiar with large scale incidents, and have well-honed incident management processes. When there are major issues (like a total processing failure, or a major business impacting outage) the incident processes are automatically implemented.

It’s incidents in the grey where sometimes companies are hesitant invoke the incident management process, often involving many people with full notifications to the business.

While there are books written around how to do incident management, when to declare an incident isn’t uniformly understood. Why? Because in IT every day we have a multiple of issues dealt with in the normal course of business. Declaring an incident is often viewed as a “big deal.”

It is a big deal. The best and brightest drop what they are doing and focus on the issue.

The catastrophic failures are easy…instant incident.

On others, often tickets are being opened on help desk(s) and routed for resolution. If you’ve ever been around a help desk, you know if one minute they are busy and the next they are slammed….there is something going on.

When considering declaring an incident, POTENTIAL BUSINESS IMPACT is my metric.

So, one desktop down is important, 100 desktops down in a call center is a big deal.

A one person sales office having a phone issue is bad, and a 250 person HQ being without phones is really bad.

The way to minimize business impacts is through contingency planning. If one facility is down, the business/traffic is routed somewhere else.

Large companies do this as a matter of course; smaller companies often don’t feel they have the mass to successfully pull it off.

Declaring an incident should be celebrated as a way to get others to help quickly mitigate business impacts.

When have you “declared an incident?”

Gary L Kelley | |

Email Article |

Print Article