Why do you need to carry out major problem reviews?

There was a discussion of “major problem reviews” on the Facebook back2ITSM group last week. Lots of questions were asked, and the list that follows is my summary of the issues raised:

  • How to identify a major problem and what distinguishes it from other problems
  • Whether there are any business benefits to doing a major problem review and if so what they are
  • How a major problem review differs from Continual Service Improvement (CSI) or root cause analysis of a major problem
  • What the first steps of a major problem review should be, and what practical guidance there is for undertaking one

There are so many inter-related issues about problem management being raised here, that I thought it would be better to assemble my thoughts into a blog – then I can post a link to this blog into the Facebook thread.

Separating problems from incidents

Many people confuse incidents and problems, so let’s start by making sure the difference is clear.

An incident is an interruption to an IT service, or a reduction in quality of a service. Incidents have an impact on users, and on the business, and the purpose of incident management is to restore normal service.  There is no need to understand the underlying cause, and definitely no need to rectify any technology faults, in order to resolve an incident. For example if a user has a problem with a PC then this could be resolved by swapping the PC for a new working one – provided that this has everything they need to continue working.

A problem is the underlying cause of one or more incidents. The purpose of problem management is to manage the problem, eliminating future incidents where possible and to reduce the impact of incidents that can’t be prevented. Typical activities of problem management are documenting workarounds, investigating root causes, and submitting change requests to fix infrastructure and applications.

How do you identify major problems?

Problem

When you want to distinguish major problems from other problems you should start by understanding how to prioritise your problems. Things to think about when deciding the priority of a problem might include:

  • How many related incidents have there been?
  • What business impact have the incidents had?
  • Is the incident frequency increasing or reducing?
  • Is the business impact of the incidents increasing or reducing?
  • How recent were the incidents?
  • Is there an expectation that there will be future incidents?
  • How effective is any workaround that has been developed?
  • What impact have the incidents had on customer satisfaction?
  • What impact have the incidents had on the ability of the IT organization to meet agreed targets (SLAs)?
  • What financial or other impact have the incidents had on the IT organization (for example what IT resources have been required to resolve these incidents)?

For me a major problem is one where:

  • There have been one or more incidents that had a significant business impact
  • You (or your customer) expect that there may be future repeats of the same incidents
  • You need to ensure that future related incidents have minimal business impact

Your definition may be different, but this might make a good starting point for your discussions.

How should you manage a major problem

I don’t intend to provide detail on management of problems in this blog, as that would take me way past an acceptable length for a blog. I just want to share some of the things that I expect to happen during the management of a major problem.

I take quite a controversial position here. Many people think that the purpose of problem management is to understand the root cause of incidents. I think that root cause analysis is a useful technique, but I don’t actually care about root causes – what I care about is whether or not I have prevented the problem from happening again, or at least done something to reduce the impact next time.

So I believe that the first thing to do during major problem management is to devise a workaround. It may not be perfect, but you need to decide quite quickly what you’ll do if the problem happens again. If your best technical people think about this first, then they can give the service desk and the level 2 support teams some actions to take to mitigate the problem next time it happens. This will directly address the impact on the business, and may even be enough to reduce the priority of the problem, in which case you may no longer have a major problem to manage and you can revert to your normal problem management process.

Once you have a workaround in place you can move on to root cause analysis, and problem rectification, where necessary. But you will always need to monitor your workaround to make sure it is effective and you may need to improve it.

There are lots of great techniques for investigating problems, and I will leave you to research these yourself.

Major problem reviews

And that gets me to the point where I can write about major problem reviews. Why should you do them? What value do they have?

Often when people are working on problems they focus on a single “root cause”, but any real problem has a large number of contributory causes. In resolving the problem you may have just fixed one specific cause, but there may be lots of other things that also need to be addressed. A major problem review is an opportunity to identify some of these other causes, and do something about them.

For example if the root cause of the incidents was a software bug, then the fix may have been a software patch. Other possible causes that you might need to address include:

  • How was the bug introduced? Do you have the right software development skills, tools and standards?
  • Why was the bug not discovered during testing? Do you need to consider making improvements to testing, or to test environments?
  • Why did it take so long to identify, diagnose and rectify the bug? Do you have the right application instrumentation? Are the right skills and tools in place to enable support personnel to detect and debug problems like this?

Depending on the problem, there may be a large number of underlying issues. One useful way to identify these is to use the 5 Whys technique. Ask why the incident happened, and whatever answer you get ask why that happened, keep asking why until you have fully understood what happened (this may be more than 5 times, or fewer). Once you have identified these contributory causes add them to a risk register, or to a continual improvement register, so that they can be seen and prioritized. Before you can ask these questions, you need to make sure you have all the information you're going to need. Get a detailed timeline on what happened during the problem analysis. I'm not talking about getting an incident timeline (although you might need that as well), but a timeline from when the incident started to when you closed the problem. This will involve interviews with lots of people, and you need to make sure they all understand the purpose is not to blame anyone, but to identify improvement opportunities. An organization that blames people after a major problem rarely finds out the truth about what really happened.

Another reason for carrying out a major problem review is to learn from the problem investigation, to try to reduce the impact of other future problems. Every major problem should be seen as an opportunity to learn from what you did, and improve for next time. Questions you could ask yourself include:

  • How quickly did you realise this was a major problem? How could you have done this faster?
  • Did you engage the right people to work on the problem? Did you get the people you needed quickly enough?
  • Did you communicate well with customers and users? Did they have all the information they needed, when and where they needed it?
  • Did the technical and problem management personnel communicate well with each other? Was there wasted effort or was all the investigative work well focussed?
  • Did you get an effective workaround in place, or did you make the customer and users wait while you investigated the root cause?
  • Did you have the right tools and resources to investigate the problem?

You can probably think of many more questions that should be asked, and when you have answered these questions you will almost certainly have identified lots more things to be added to your risk register or continual improvement register, and that will help you to get better at managing major problems in the future.

 

Image credit: andy.brandon50

 

comments powered by Disqus

Optimal Service Management Ltd.

7 Ingatestone Road, Woodford Green,
Essex, IG8 9AN, UK

Registered No: 8791379 England

Phone: +44 20 8504 2002

Recent Posts

Latest Tweets