Get Instant Access
to This Blueprint

Infrastructure Operations icon

Incident and Problem Management

Don’t let persistent problems govern your department.

  • IT infrastructure managers have conflicting accountabilities. It can be difficult to fight fires as they appear while engaging in systematic fire prevention.
  • Repetitive interruptions erode faith in IT. If incidents recur consistently, why should the business trust IT to resolve them?

Our Advice

Critical Insight

  • Don’t risk muddling the chain of command during a crisis. Streamline the process. When senior technical staff are working on incidents, they report to the service desk manager.
  • Incidents defy planning, but problem management is schedulable. Schedule problem management; reduce unplanned work.
  • Just because a problem has not caused an incident doesn’t mean it never will. Get out in front of problems. Maximize uptime.

Impact and Result

  • Define the roles and responsibilities of the incident manager and the problem manager.
  • Develop a critical incident management workflow that will save money by streamlining escalation.
  • Create a problem management standard operating procedure that will reduce incident volume, save money, and allow upper tier support staff to engage in planned work as opposed to firefighting.

Incident and Problem Management Research & Tools

Start here – read the Executive Brief

Read our concise Executive Brief to find out why sub-optimal incident management could be costing you money, and how to implement incident, problem, and proactive problem management best practices.

1. Identify and manage major/critical incidents

Develop a plan to identify critical incidents and to resolve them with as little friction as possible.

2. Develop problem management procedures

Systematically identify problems, and develop a procedure for opening, resolving, and closing problem tickets.

3. Engage in proactive problem management

Develop thresholds for event management, predict problems, and communicate the importance of proactivity to stakeholders.


Member Testimonials

After each Info-Tech experience, we ask our members to quantify the real-time savings, monetary impact, and project improvements our research helped them achieve. See our top member experiences for this blueprint and what our clients have to say.

9.5/10


Overall Impact

$13,949


Average $ Saved

20


Average Days Saved

Client

Experience

Impact

$ Saved

Days Saved

General Conference of Seventh-day Adventists

Workshop

9/10

$25,419

20

Oregon Secretary of State

Guided Implementation

10/10

$2,479

N/A

Akin Gump Strauss Hauer & Feld LLP

Workshop

9/10

$30,999

5

The University of Texas at San Antonio

Guided Implementation

8/10

$68,199

120

The World Bank

Guided Implementation

10/10

$2,479

5

Alexion Pharmaceuticals Inc.

Guided Implementation

10/10

$100K

20

Mott MacDonald LLC

Guided Implementation

10/10

$42,750

9

Shentel Management Company

Workshop

10/10

$46,097

5

University of North Texas System

Workshop

10/10

$34,099

75

Milwaukee Metro Sewerage District

Guided Implementation

5/10

$2,419

2

Varian Medical Systems, Inc.

Guided Implementation

10/10

$35,017

5

Lee County Clerk of Courts

Guided Implementation

10/10

N/A

N/A

Bermuda Monetary Authority

Workshop

9/10

N/A

N/A


Incident and Problem Management

Resolve service issues faster and eliminate recurring incidents.
This course makes up part of the Infrastructure & Operations Certificate.

Now Playing: Academy: Incident and Problem Management | Executive Brief

An active membership is required to access Info-Tech Academy
  • Course Modules: 4
  • Estimated Completion Time: 2-2.5 hours
  • Featured Analysts:
  • John Annand, Senior Manager, Infrastructure Research
  • Fred Chagnon, Research Director, Infrastructure and Operations Research

Onsite Workshop: Incident and Problem Management

Onsite workshops offer an easy way to accelerate your project. If you are unable to do the project yourself, and a Guided Implementation isn't enough, we offer low-cost onsite delivery of our project workshops. We take you through every phase of your project and ensure that you have a roadmap in place to complete your project successfully.

Module 1: Incident Management

The Purpose

  • Develop a framework and process for managing incidents.

Key Benefits Achieved

  • Systematize critical incident management identification.
  • Define escalation rules.

Activities

Outputs

1.1

Roles and responsibilities for service desk and infrastructure

  • Incident Manager role description
1.2

Review ticket categorization

  • Critical Incident Management SOP
1.3

Prioritization schema

1.4

Define escalation rules, document and critique workflows

1.5

Establish SLAs

  • Internal SLA to service desk
1.6

Develop the knowledgebase

  • Knowledgebase Article Template
1.7

Develop KPIs

  • Incident management KPIs

Module 2: Problem Identification and Categorization

The Purpose

  • Identify and categorize incoming problems.

Key Benefits Achieved

  • Systematic problem intake process.
  • Clear problem categorization.
  • Identification of problem management roles and responsibilities.

Activities

Outputs

2.1

Problem intake criteria and incident review process

  • Problem Management SOP
2.2

Problem ticket categorization

  • Problem Ticket Template
  • Categorization and impact schema
2.3

Role development for problem management

  • Problem Manager role description
2.4

Problem management KPIs

  • Problem management KPIs

Module 3: Root Cause Analysis

The Purpose

  • Develop root cause analysis techniques and track problem status.

Key Benefits Achieved

  • Understanding of how to conduct root cause analysis.
  • Method for tracking problem status.

Activities

Outputs

3.1

Root cause analysis exercises

  • Root cause analysis procedure
3.2

Quantification of risk

  • Problem risk assessment
3.3

Problem reporting requirements and audience

  • Problem status dashboard
3.4

Updating and maintain the knowledgebase

Module 4: Proactive Problem Management

The Purpose

  • Develop proactive problem management procedures.

Key Benefits Achieved

  • Proactive problem management SOP, which will enable the prevention of interruptions.

Activities

Outputs

4.1

Information sources for problem identification

  • Monitoring and alert thresholds
4.2

Data quality participants and agenda review for active regular data analysis

4.3

Complete comprehensive visual SOP

  • Visual SOP

Incident and Problem Management

Don’t let persistent problems govern your department.

ANALYST PERSPECTIVE

Incident and problem management is something you are doing whether or not you have a formal process. Why not do it right?

“Incident response is the ER in the delivery of IT medical care. It is immediate, it can be chaotic, it is disruptive, and it is highly expensive. This is no time to be short on resources, confused about priorities, escalation, lines of communication, or who has responsibility to initiate extraordinary measures. Only through extensive and proper preparatory work can you maximize your chances of saving the patient.

With the occurrence of every incident there exists an opportunity to identify a problem. Understanding and correcting these underlying conditions is a parallel but distinctly separate activity. One that has different priorities, needs different information, and is methodical work. But one that is no less vital to the health of the business.

Our research helps you prepare for inevitable occurrence of incidents but, more importantly, it allows you to develop an active problem management process that in turn will decrease the number, severity, and time to resolution of those incidents. Let us help you turn disruption and chaos into planned work, optimizing the utilization of some of your highest paid resources.” (John Annand, Senior Manager, Infrastructure Practice, Info-Tech Research Group)

Go beyond reactive incident management; get to the root of the problems and proactively reduce incident volumes

This Research Is Designed For:

  • IT Infrastructure Managers
  • IT Service Desk Managers
  • IT Operations Managers

This Research Will Also Assist:

  • CIOs
  • CISOs

This Research Will Help You:

  • Develop a process for identifying, escalating, and managing IT incidents and problems.
  • Use root cause analysis to solve problems and reduce incident volume.
  • Communicate the business case for investing in proactive problem management.

This Research Will Help Them:

  • Improve IT’s efficiency
  • Manage security risks

Executive summary

Situation

  • IT is spending too much time and money on incidents. Users are losing too much productivity to routine avoidable incidents.
  • IT incidents are highly visible. The impact of poor incident and problem management procedures will ripple through the organization.

Complication

  • IT infrastructure managers have conflicting accountabilities. It can be difficult to fight fires as they appear while engaging in systematic fire prevention.
  • Repetitive interruptions erode faith in IT. If incidents recur consistently, why should the business trust IT to resolve them?

Resolution

  • Use the process described in this storyboard: a formal, systematic approach to incident and problem management that is incorporated into the service desk process.
  • Clearly define the roles of “incident manager” and “problem manager” and outline how incidents tickets should make their way through the service desk process.
  • Make the case for proactive problem management, and successfully reduce the volume of unplanned work by integrating proactive problem management into regular IT activity.
  • Develop a comprehensive, visual standard operating procedure to guide responders to incidents and problems along each step of the way.

Info-Tech Insight

  1. Don’t risk muddling the chain of command during a crisis. Streamline the process. When senior technical staff are working on incidents, they report to the service desk manager.
  2. Incidents defy planning, but problem management is schedulable. Schedule problem management; reduce unplanned work.
  3. Just because a problem has not caused an incident doesn’t mean it never will. Get out in front of problems. Maximize uptime.

Incidents are the symptoms; problems are the disease

Incident

  • An IT incident occurs where something is not working the way it is supposed to – a fault or an error.
  • Common IT incidents:
    • Error messages
    • Network down
    • Printer or scanner not working
    • Physical issues with hardware (dropped laptop, lost mobile phone, liquid on keyboard)
  • Incident management involves receiving incident reports, categorizing incidents, and eventually resolving them in order to minimize downtime.

Problem

  • According to ITIL, a problem can be the cause of multiple incidents, an incident that impacts a large number of end users, or the discovery of a system’s deviation from expectations.
  • Common IT problems:
    • Frequent system crashes
    • Improperly installed drivers complicating use of accessories like printers
    • A constant stream of network outages
  • Problem management confronts root causes. It can be proactive or reactive.

Info-Tech Insight

Root cause analysis is not only for incidents. A high volume of routine service requests can be resolved by doing root cause analysis and changing processes to make them more efficient, saving time and money.

Poorly managed incident response can drown even the most buoyant organizations

  • Incident management is the most visible function of the IT department for many end users.
  • An incident management process that is integrated with the service desk will:
    • Shorten incident resolution times.
    • Minimize the overall impact of incidents on the business, in terms of both volume and severity.
    • Create a knowledgebase to optimize future incident resolution.

Without incident management:

  • IT incidents might go unresolved or forgotten, since there is no centralized way to track or prioritize incoming incident tickets – and no way to report or improve on the process.
  • Unnecessary escalation distracts tier two and three specialists who should be focused on project work.

With incident management:

  • Service desk technicians can leverage the documented institutional knowledge of the senior staff without disturbing them.
  • Documentation in a known error database (KEDB) provides a rich source of knowledge that can help the problem manager identify problems.

Info-Tech Insight

Incident management is focused narrowly on resolving specific service interruptions as they occur. While it is crucially important, it is only part of a complete service management plan.

Problem management will improve service quality and decrease support costs

A rigorous problem management process will decrease the number of incidents, reduce their time to resolution, and improve IT efficiency.

With Problem Management

  • Fewer total incidents.
  • Fewer recurring incidents.
  • Fewer critical/major incidents.
  • Faster determination of root cause (and less time searching for causes).
  • More accurate determination of root cause.
  • Faster problem resolution.
  • Less time/effort spent on resolving recurring incidents.
  • Less time/effort spent on problem resolution.

Without Problem Management

  • Time and effort wasted fixing the same thing over and over again.
  • High percentage of the IT budget is dedicated to firefighting.
  • Lack of end-user/executive confidence in IT platforms (hardware/software).
  • Service disruptions causing slow or halted productivity, impacting customers.
  • Frustrated end users.
  • Negative perception of the IT department by the business.

Info-Tech Best Practice

It is impossible to engage in systematic incident and problem management without a functioning service desk. If your service desk is not yet up to par, refer to Info-Tech’s Standardize the Service Desk blueprint for guidance.

Save up to $500,000 using Info-Tech’s incident management tools and templates

When IT does its job poorly, everyone’s ability to work is impacted

Incident management saves money

Assume that each incident takes an extra hour to resolve as a result of an inefficient process including improper prioritization and escalation.

Consider productivity costs:

Assume there are 10,000 incidents in a year that result in an extra hour of downtime. For a large organization (40,000+ employees), this could represent about 10% of incidents. In the United States, employees are paid an average of about $27 per hour. This would result in an additional annual expense of $270,000 in lost productivity.

Consider labor costs of support:

On average, an IT support technician makes about $20 per hour. The extra hour it takes them to solve an incident would result in $200,000 per year in extra costs.

Subpar incident management can cost $500,000!
(Sources: Info-Tech interview, Social Security Administration, OECD, and Glassdoor. This analysis is meant to illustrate incident management’s potential.)

Info-Tech Best Practice

When the service desk or incident management team resolves an incident by providing a workaround and restoring service, it is essential to close the incident ticket and flag it for inclusion in the problem register so that the problem management team can prevent recurrence.

All incidents are not created equal: understand what makes an incident major

Example: Incident

Bill from the service desk receives an incident report from Pat in accounting, who is a service desk regular. Pat is locked out the workstation, and cannot complete any assigned tasks until the issue is resolved. Bill receives the incident ticket, reminds Pat that the triangle with the exclamation point in it means that caps lock is on, resets the password, and closes the incident ticket.

Takeaway: A single user was the only person directly impacted by the incident. Bill correctly surmised from the initial incident report that a single user forgetting their password does not represent a real threat to the business, and that finding the root cause is unlikely to be worth the effort. Sometimes people just forget their passwords, Bill thought. That’s why the helpdesk exists!

Example: Major Incident

The entire Payroll Department is locked out of their workstations one morning. Bill is overwhelmed with a flurry of calls one morning, and he quickly realizes the extent of the incident. He notifies the service desk manager, and the major incident management procedure is initiated. All 36 passwords are reset, and the problem manager, who has also been notified because of the severity of the incident, begins root cause analysis, which reveals that the email notifying users of a password reset bypassed Payroll entirely.

Takeaway: Because a large number of people were unable to do their jobs, the incident management team automatically escalated, created a problem ticket, and, after resolving the incident, found the cause and updated the knowledgebase, preventing the issue from reoccurring.

Info-Tech Insight

Every incident has a root cause. Not every incident requires root cause analysis, however. Simple incidents that are one-offs or otherwise minor can be solved with simple workarounds, or, if that is not possible, are unlikely to impact business processes in a meaningful way.

Incidents can become problems that require root cause analysis

A critical incident can become a problem on its own, or multiple smaller incidents can also become a problem.

Info-Tech’s process aligns to the COBIT best-practice framework

Info-Tech aligns with COBIT, which is accepted industry-wide as a best practice framework. Adoption of these standards provides stakeholders with the confidence that the changes in infrastructure management are based on best practices.

DSS02: Manage Service Requests and Incidents Info-Tech: Incident and Problem Management
DSS02.01: Define incident and service request classification schemes Step 1.1: Identify critical incidents
DSS02.02: Record, classify, and prioritize requests and incidents
DSS02.03: Verify, approve, and fulfil service requests
DSS02.04: Investigate, diagnose, and allocate incidents
DSS02.05: Resolve and recover from incidents Step 1.2: Create a critical incident workflow
DSS02.06: Close service requests and incidents
DSS02.07: Track status and produce reports

Info-Tech’s process aligns to the COBIT best-practice framework (cont.)

Info-Tech aligns with COBIT, which is accepted industry-wide as a best practice framework. Adoption of these standards provides stakeholders with the confidence that the changes in infrastructure management are based on best practices.

DSS03: Manage Problems Info-Tech: Incident and Problem Management
DSS03.01: Identify and classify problems Step 2.1: Identify problems
DSS03.02: Investigate and diagnose problems
DSS03.03: Raise known errors Step 2.2: Develop a problem management workflow
DSS03.04: Resolve and close problems
DSS03.05: Perform proactive problem management Step 3.1: Predict problems

Info-Tech Insight

Critical incident and problem management need to be addressed separate from, but in conjunction with, the service desk. Critical incidents tickets need to be closed in the same way as problem tickets.

Follow our service management methodology and get to your target state in stages

Info-Tech's service management methodology in stages with 'Incident Management' and 'Problem Management' as the 2nd and 3rd step in the process. Highlighted points on the line are 'Stabilize: Deliver stable, reliable IT services to the business, Respond to user requests quickly and efficiently, Resolve user issues in a timely manner, Deploy changes smoothly and successfully'; 'Service Provider: Create a service catalog that documents services from the user perspective, Measure service performance, based on business-oriented metrics'; 'Proactive: Understand your environment and take action to: Avoid/prevent service disruptions, Improve quality of service (performance, availability, reliability)'; and 'Strategic Partner'.

The resolution procedure for a critical incident is different from a typical incident

Critical, or major, incidents require root cause analysis

  • The nature of the business, the sensitivity of stored information, and the attitude of the executive will help you define what constitutes a critical incident for your organization.
  • Unlike regular incidents, critical incidents require the opening of a problem ticket; root cause analysis is necessary to determine incident source, and prevent it from reoccurring. Develop an explicit handoff process between tier 1 and tier 2 incident management and problem management. Reduce the number of critical incidents. Save money.
  • Develop a Critical Incident Workflow to systematize your critical incident handling procedure. Determine when incidents should be escalated to critical, and trace the path of the ticket through the process.
A bar chart titled 'Factors Determining Incident Severity' measuring the 'Percentage of Respondents' for various 'Factors'. 'Length of Outage' had ~80%, 'Customer Impact' had ~75%, and 'Employee Impact' had ~60%.(Source: xMatters)

Other potential variables impacting an incident’s criticality can include impact on specific services, the involvement of the executive, number of incoming calls, etc.

Build a problem management workflow to reduce problem incidence and severity – and save money

Respond to problems as they occur

  • Use the following tools and templates to improve your problem management process through systematization:
  • Each completed template is a component of the overall incident and problem management standard operating procedure.

Problem management saves money

Assume that a large organization logs approximately 100,000 incidents annually. (An Info-Tech interviewee at a company with 40,000+ employees reported 12,000 -15,000 incidents per month.) Incidents cost between $22 and $471 to resolve (Supplied by MetricNet), depending on the tier of support they are escalated to. If problem management reduces the volume of tier one incidents by 10%, this could result in savings of $220,000 annually – a conservative estimate.

Actual savings will depend on a variety of factors, including incident severity, initial frequency of incidents, staff salaries, average time to resolution, and other similar qualities. But the point still stands: effective problem management will reduce costs for the company, either by preventing incidents, reducing their severity, or hastening their time to resolution. The business case for problem management is clear.

Info-Tech Insight

It is crucial to separate the roles of incident manager and problem manager. The incident manager should focus on finding a workaround. The problem manager is more concerned with root cause analysis. Sometimes these roles conflict!

Develop proactive problem management procedures and reduce unplanned work

Plan problem management

  • On average, IT departments report that nearly 70% of their time and resources are spent maintaining and fixing current infrastructure/applications (5.6 hours per day, or 30 weeks out of the year). This leaves little for project work like development and innovation.
  • IT innovation leadership explains 75% of variation in satisfaction with IT. Time spent firefighting reduces IT’s prestige and can damage your career prospects.
  • (Source: Info-Tech’s CIO Business Vision, N=305)

“[The job of IT operations executives] is to ensure the fast, predictable, and uninterrupted flow of planned work that delivers value to the business while minimizing the impact and disruption of unplanned work so [they] can provide stable, predictable, and secure IT service.” (Gene Kim et al in The Phoenix Project: A novel about IT, DevOps, and Helping Your Business Win)

“Most companies don’t have people just waiting for incidents to happen. So when an issue occurs they have other stuff they’ve been delayed in working on because they’ve been engaged in restoring service from an outage.” (Hardy Baker, Problem Manager, Waste Management)

Info-Tech Insight

The purpose of incident management is to find an immediate solution or workaround when a service is interrupted. It is necessarily reactive. Problem management, however, is reflective and proactive. Leverage data gathered from incident reports to prevent service interruption using a systematic process.

Waste management developed a mature incident and problem management process and has continually improved

CASE STUDY

Industry: Environmental Services
Source: Interview

Waste Management

Waste Management, headquartered in Houston, Texas, is a world- leading provider of environmental services, including waste collection, landfill operations, and recycling. It has more than 20 million municipal, commercial, and industrial customers across North America.

Problem Management Dashboard

Waste Management is a huge corporation. The service desk handles 12,000–15,000 incident tickets per month, but they had a problem getting executive support for their problem management initiative. Their problem manager recognized the threat posed by misalignment between the executive and the engineers on the ground, and created a problem dashboard to record all incidents in real time.

Results

Problem management dashboard available to executives to see what problem tickets are currently open, the status of those tickets, and year-to-date information about problems and incidents. Armed with hard data, it became easy to convince the executive to engage in problem management and systematic improvement.

“Part of my job as a problem manager is making the work of our support teams visible. It provides us great help to say, if so far this year the network has had 300% more problems than any other product, maybe we need to be focusing in that area to make improvements in general…I can come and say ‘here’s some real data about what has actually happened.’” (Hardy Baker, Problem Manager, Waste Management)

Collaborate across service desk tiers to maximize efficiency: a single process eliminates cross-purposes

When incidents are escalated above tier one, the service desk manager is in charge of their resolution to streamline the process.

A diagram with support tiers 1 to 3, 'Service Desk Manager' above them, and a graphic of an escalator as the background. 'Tier One: Service Desk (1)' is escalated to both the next tier and the SDM. 'Tier Two: Desktop Support (2)' is escalated to both the next tier and the SDM. 'Tier Three: Apps, NOC (3)' is escalated only to the SDM.

Info-Tech Insight

When tier two and three technicians are working on incidents, they report to the service desk manager. Streamline the process. Don’t risk muddling the chain of command during a crisis or losing track of support tickets by employing a separate process for some incidents.

Problem management is risk management

Risk management is your best job security strategy

  • According to ITIL, problems are the underlying cause of multiple incidents. Incident resolution is a stopgap: the problem is often still there, bubbling under the surface.
  • Problems may or may not cause incidents at a given time: a problem is a risk. Will the server go down during peak hours?

“Problem management is just a form of risk management. A problem is a known IT risk.” (Rob England, Managing Director, Two Hills Ltd, Blogger at ITskeptic.org)

The top four reasons CIOs lose their jobs:

  1. Project failures
  2. Security breaches
  3. Disaster recovery failures
  4. System failures
  5. (Source: Silverton Consulting)

Effective risk management and communication protects your position in the company – each of the reasons CIOs lose their jobs is tied to ineffective risk management. Hoping for the best is not a risk management strategy. Quantify the risk associated with IT problems and protect your job.

Understand your vulnerability and threat levels with the risk severity matrix

  • Likelihood: A straightforward way to rate likelihood is by determining what constitutes high to low likelihood:
    • E.g. Level 1 or very low likelihood could mean the event (threat or vulnerability) occurs once a year.
    • E.g. Level 4 or very high likelihood could mean 5 or more occurrences a year.
  • Impact: Impact can be broken down into how an event affects cost/revenue, program/projects, operations/services.
    • E.g. A Level 1 (low) Impact could result in $1,000 cost/revenue, delay projects up to 7 days, and create 10 minutes of downtime during office hours.

A risk severity matrix with x-axis 'Likelihood' and y-axis 'Impact'. Low likelihood and impact is 'Low' risk, high likelihood and impact is 'High' risk, medium-low combos are 'Low' risk, high-low and high-medium combos are 'Medium' risk.

Info-Tech Insight

Risk is impossible to eliminate. Some known IT problems are simply not worth solving. Direct your resources to the head of the queue. Mitigate the most impactful and likely problems before moving down the list. Remember: it’s okay to have a backlog as long as prioritization is done based on evidence.

Make the business case for proactive problem management

Stakeholder buy-in is critically important

  • According to the Project Management Institute, actively engaged executive sponsors are the most important factor in project success.
  • Get your organization’s stakeholders on board with your proactive problem management plan.
Without a plan:

“We should implement a proactive problem management plan because it might help us prevent incidents in the future.”

With a plan:

“Proactive problem management has the potential to reduce incidents by 30%, saving us $17,000 per month.”

Benefits of stakeholder management

  • Better buy-in, understanding, and communication
  • Enhanced understanding of business needs
  • Improved stakeholder satisfaction
  • Better quality decision making
  • Improved transparency, trust, and credibility
  • Less waste and rework
  • Fewer surprise issues or challenges to the CIO’s agenda
  • Greater ability to secure support and execute the agenda
  • More effective cooperation on activities to get things done
  • Fewer failed initiatives
  • Improved risk management and mitigation
  • Better quality information and greater value from stakeholder input
  • Better understanding of IT performance and contribution

Follow Info-Tech’s methodology to manage incidents and problems

Phases Phase 1: Identify and manage major/critical incidents Phase 2: Develop problem management procedures Phase 3: Engage in proactive problem management
Steps

1.1

Identify Critical Incidents

2.1

Identify Problems

3.1

Predict Problems

1.2

Develop a Critical Incident Workflow

2.2

Develop a Problem Management Workflow

3.2

Communicate the Importance of Proactivity
Tools and Templates Incident Severity Assessment Tool Risk Assessment Tool Problem Management Standard Operating Procedure
Incident Manager Job Description Problem Ticket Template Incident and Problem Mgmt Communication Deck
Critical Incident Management Standard Operating Procedure Problem Manager Job Description
Problem Management Standard Operating Procedure

Info-Tech delivers: Use our tools and templates to accelerate your project to completion

  1. Incident Severity Assessment Tool
    Separate critical incidents from the pack. Determine when to invoke your critical incident workflow.
  2. Critical Incident Management SOP
    Develop a plan to escalate, resolve, and close incidents, making the most efficient use of your resources.
  3. Risk Assessment Tool
    Problems can be conceptualized as risk – evaluate risk by notionally measuring impact and frequency.
  4. Problem Management SOP
    Develop a plan to match incidents, point out problems – reactively and proactively – and conduct root cause analysis.
  5. Incident and Problem Management Communication Deck
    Communicate the importance of the project to stakeholders outside of IT using real data from your organization.

Insight breakdown

Phase 1 Insight

Don’t risk muddling the chain of command during a crisis.
Streamline your process. Upper tier incident response staff typically report to managers outside of the service desk hierarchy (infrastructure, applications). When a serious incident occurs, they need to drop everything to restore service. Make sure that there is no confusion of loyalties: when a critical incident forces their presence, upper tier staff report to the service desk manager.

Phase 2 Insight

Incidents defy planning, but problem management is schedulable.
Unplanned work is a constant threat to the productivity of project workers. Constant task-switching, and the perception that one’s role is fighting fires – even in roles otherwise unrelated to service provision – is draining. Make problem management a part of your organization’s project work and avoid the headache that comes with spontaneity. Incident management cannot be planned, but IT problems are known risks; they can be identified, prioritized, and resolved on a schedule.

Phase 3 Insight

Just because a problem has not caused an incident doesn’t mean it never will.
ITIL defines problems as the cause or potential cause of one or more incidents. A problem is still a problem, even if it has yet to cause an incident. Engage in proactive problem management to prevent interruptions from occurring at all. For many users, service interruptions are the primary driver of satisfaction (or dissatisfaction) with IT. Leverage event management to keep the business satisfied.

Use these icons to help direct you as you navigate this research

Use these icons to help guide you through each step of the blueprint and direct you to content related to the recommended activities.

A small monochrome icon of a wrench and screwdriver creating an X.

This icon denotes a slide where a supporting Info-Tech tool or template will help you perform the activity or step associated with the slide. Refer to the supporting tool or template to get the best results and proceed to the next step of the project.

A small monochrome icon depicting a person in front of a blank slide.

This icon denotes a slide with an associated activity. The activity can be performed either as part of your project or with the support of Info-Tech team members, who will come onsite to facilitate a workshop for your organization.

Incident and Problem Management – project overview

1. Identify and manage major/critical incidents 2. Develop problem management procedures 3. Engage in proactive problem management
Supporting Tool icon

Best-Practice Toolkit

1.1 Identify Critical Incidents
1.2 Develop a Critical Incident Workflow
2.1 Identify Problems
2.2 Develop a Problem Management Workflow
3.1 Predict Problems
3.2 Communicate the Importance of Proactivity

Guided Implementations

  • Outline the potential benefits of a critical incident management procedure.
  • Review the results of the voting exercise and the list of exceptions.
  • Outline the benefits of a problem management regimen and the required resources.
  • Review the separated lists of incidents and problems.
  • Review the incident matching procedure.
  • Outline the required inputs for proactive problem management.
  • Review proactive problem management techniques.
  • Collate and present the visual SOPs.

Onsite Workshop

Module 1:
Incident Management
Module 2:
Problem Identification and Categorization
Root Cause Analysis
Module 3:
Proactive Problem Management
Phase 1 Outcome:
  • Incident manager job description
  • Critical incident management standard operating procedure
  • A list of the benefits of critical incident management
  • A list of critical applications
  • List of incidents sorted by severity
Phase 2 Outcome:
  • Problem manager job description
  • Separated list of incidents and problems
  • Incident matching process
  • Problem ticket template
  • An understanding of root cause analysis techniques
  • A framework for evaluating risk, and a prioritized list of IT problems
  • A list of key performance indicators
  • A meeting schedule for the problem management team
Phase 3 Outcome:
  • Completed stakeholder communication deck
  • Completed problem management standard operating procedure (proactive section)

Info-Tech offers various levels of support to best suit your needs

DIY Toolkit

Guided Implementation

Workshop

Consulting

"Our team has already made this critical project a priority, and we have the time and capability, but some guidance along the way would be helpful." "Our team knows that we need to fix a process, but we need assistance to determine where to focus. Some check-ins along the way would help keep us on track." "We need to hit the ground running and get this project kicked off immediately. Our team has the ability to take this over once we get a framework and strategy in place." "Our team does not have the time or the knowledge to take this project on. We need assistance through the entirety of this project."

Diagnostics and consistent frameworks used throughout all four options

Workshop Overview

Workshop Day 1
Onsite
Workshop Day 2
Onsite
Workshop Day 3
Onsite
Workshop Day 4
Onsite
Workshop Day 5
Offsite
Activities
Incident Management
  1. Roles and responsibilities for Service Desk and Infrastructure
  2. Review ticket categorization
  3. Prioritization schema
  4. Define escalation rules, document and critique workflows
  5. Establish SLAs
  6. Develop the knowledgebase
  7. Develop KPIs
Problem Identification & Categorization
  1. Problem intake criteria and incident review process
  2. Problem ticket categorization
  3. Role development for Problem Management
  4. Develop Problem Management KPIs
Root Cause Analysis
  1. Root Cause Analysis exercises
  2. Quantification of risk
  3. Problem reporting requirements and audience
  4. Updating and maintain the knowledgebase
Proactive Problem Management
  1. Information sources for problem identification
  2. Data quality
  3. Participants and agenda review for active regular data analysis
  4. Complete comprehensive visual SOP
Document and Package Deliverables
  1. Finalize deliverables
  2. Support communication efforts
  3. Identify resources in support of priority initiatives
Deliverables
  1. Critical Incident Management SOP
  2. Incident Management KPIs
  3. Internal SLA to Service Desk
  4. Knowledgebase Article Template
  5. Incident Manager Role Description
  1. Problem Management SOP
  2. Ticket Template
  3. Categorization and Impact Schema
  4. Problem Manager Role Description
  5. Problem Management KPIs
  1. Problem Status Dashboard
  2. Problem Risk Assessment
  3. Root Cause Analysis Procedure
  1. Visual SOP
  2. Monitoring and Alert Thresholds
  1. Executive Presentation / Business Case for Problem Management

Incident and Problem Management

PHASE 1

Identify and Manage Major/Critical Incidents

Step 1.1: Identify Critical Incidents

PHASE 1
PHASE 2
PHASE 3
1.1 1.2 2.1 2.2 3.1 3.2
Identify Critical Incidents Develop a Critical Incident Workflow Identify Problems Develop a Problem Management Workflow Predict Problems Communicate the Importance of Proactivity

This step will walk you through the following activities:

  • Define the benefits of a critical incident management procedure.
  • Brainstorm critical applications.
  • Assign severity levels to incidents through a voting activity.
  • Present and discuss the results of the voting exercise to all of the participants.
  • Develop a list of exceptions to the critical incident classification scheme.

This step involves the following participants:

  • Incident manager
  • Service desk manager
  • Service desk technicians

Outcomes of this step

  • A list of the benefits of critical incident management
  • A list of critical applications
  • List of incidents sorted by severity

Identify incidents

ITIL’s definition of an incident:

“…any event which is not part of the standard operation of a service which causes, or may cause the interruption to, or a reduction in, the quality of that service.” (Source: ITIL Best Practice for Service Delivery, 314)

Common IT Incidents:

  • Account lockouts
  • Email issues
  • Hardware issues (phone, laptop not working)
  • Printer issues
  • Slow internet

Major/critical incidents:

What constitutes a major incident will vary based on each organization’s particular characteristics. In broad terms, an incident is major if it impacts a large number of users, causes a significant disruption, and/or lasts a long time (but where IT infrastructure is still working: not a disaster). According to ITIL, all involved must agree that the incident is major, there must be a separate procedure, and there must be clear responsibility and a review process.

Major/critical IT incidents:

  • Enterprise-wide network outage
  • Server down

Info-Tech Best Practice

Ideally, organizations should institute a separate procedure for responding to major incidents that involves a separate incident management team and a shorter timescale.

Understand the incident lifecycle

Incidents have a seven-step lifecycle.

  1. Incident detection and recording
  2. Initial classification and support
  3. Escalation to a major incident process where needed
  4. Invocation of the service request process if not an incident
  5. Investigation and diagnosis
  6. Resolution and recovery
  7. Incident closure

Outline the case for an effective critical incident management process

Critical incident management is an essential component of IT service management.

Benefits for IT

  • Critical incidents draw significant attention. An effective management process will demonstrate IT competence where the most important stakeholders are going to see and take notice.
  • Streamlining the resolution of incidents within IT: a clear escalation path prevents incident tickets from unnecessarily bouncing around different teams.
  • Effective crisis communications can significantly drop the number of calls to the service desk about ongoing IT issues, reducing operating costs and saving time.

Benefits for the business

  • Critical incidents are expensive: an effective incident management procedure ensures a right-size response that can save the organization money.
  • A good critical incident management process involves metrics agreed upon by both the IT department and representatives from the business. Cooperation ensures that business needs are the primary concern when the critical incident management process is invoked.
  • In sum: the business stands to gain a lot from their involvement in the critical incident management process.

Info-Tech Insight

It’s not always obvious. IT and the business might have entirely different ideas about what constitutes a critical incident. Come to an agreement with representatives from the business about what constitutes a critical incident before such an incident strikes.

Define the IT benefits and business benefits of critical incident management for your organization

Associated Activity icon 1.1.a 30 minutes

INPUT: Brainstorming session, Section 1.1

OUTPUT: List of critical incident management benefits

Materials: Whiteboard, markers

Participants: Incident manager, Service desk manager, Service desk technicians

Critical incident management benefits the organization as a whole.

  • Gather employees familiar with the incident management process into a conference room for a short whiteboard activity.
  • Brainstorm some of the benefits of critical incident management. Go around the room and have everyone share an example of a benefit or a cost of critical incident management.
  • Draw a chart like the one pictured below. Insert a point form summary of each cost and benefit that participants in the workshop identify. Also identify if that cost/benefit applies more to IT or to the business in general.

Cost/Benefit

IT

Bus.

A more regimented process means that service technicians have to spend less time sweating about how to categorize particular incidents.

X

A dedicated major incident manager would need to be hired, or, barring that, the responsibility would fall on someone already in IT, reducing the amount of time they can devote to project work.

X

X

List and challenge critical IT services

Associated Activity icon 1.1.b 30 minutes

INPUT: List of applications

OUTPUT: List of critical applications

Materials: Whiteboard, Markers

Participants: Infrastructure manager, Tier 3 and tier 2 representatives, Service desk manager, Business representatives

When a critical service goes down, the normal incident management process is not enough.

  • Incident criticality is strongly dependent on the nature of the work that an organization does.
  • Have each participant outline an application that is critical to the organization’s business function.
  • Questions to ask:
    • If this application goes down for an extended period, can the business survive?
    • How many people would be affected by an outage?
    • How much will this particular application’s outage cost the company if it is not controlled quickly?
  • Record the list of critical applications in section 3.1 of the Critical Incident Management SOP.

Info-Tech Insight

The worst time to have a debate about whether an application is critical or not is right after it has failed. Disagreement about an incident’s severity can eat up precious time during a crisis. Consensus beforehand will help you to implement a systematic process when it is needed most.

Sort tickets into buckets based on their response resources, response levels, and platforms

Associated Activity icon 1.1.c 1 hour

As part of your triage process, sort incident tickets based on who will be responsible for resolving them above tier 1.

  1. The broadest classification should be by type.
    Example: software, hardware, applications, infrastructure.
  2. The next tier of classification should be by category.
    Example: hardware › fax machine, printer, video conferencing equipment.
  3. The final tier of classification should be by sub-category, where applicable.
    Example: hardware › printer › laser, inkjet, multi-function
    • Note: the lowest tier of classification is not always necessary. If you have only one type of fax machine, for example, there is no need for additional details.
  4. Input the results of the exercise into section 3.2 of the Critical Incident Management SOP.
Type Category Sub-category
Hardware Printer Laser

Sample of Info-Tech's 'Critical Incident Management SOP'.

Incident severity is a function of its impact and urgency

All incidents are not created equal. Incidents that are both high impact and high urgency are critical/major.

Incident Severity Description Example
1: System Down
  • Critical system is down
  • Many users affected
  • Critical applications are unavailable
The company’s internet goes down. Communication with the wider world becomes but a dream.
2: System Degraded
  • Performance is very slow
  • Certain functionality is not available
The network between two company locations is down.
3: Urgent
  • Several users affected
  • No workaround
The wireless connection is down in the graphic design department. Nobody has Ethernet ports in their MacBooks.
4: Normal
  • One or more users affected
  • Basic functionality available with some restriction
  • Workaround available
A printer is down. The service desk technician can reroute affected users to another printer on a different floor.
5: Service Request
  • Functionality unaffected
  • Service requests
  • Work orders
A user puts in a ticket for a new, ergonomic keyboard that they read about on the Internet.

*Note: Severity levels 1 and 2 are considered critical incidents. Severity level 5 (service requests) are out of scope for incident management. Severity levels 3 and 4 do fall under incident management, though, as routine incidents, the service desk generally handles them.

Assign measures of severity to incidents

Severity is a function of the incident’s urgency and its impact on business operations.

  • Incidents that are categorized as severity 1 and 2 present an immediate threat to the business and need to be escalated immediately.
  • Incidents sorted into severity 3, 4, or 5 are handled directly by tier 1 service desk support following the typical incident management/service desk process.
A chart titled 'Incident Severity' with severity levels 1 to 5 (as seen previously) mapped onto a graph. The x-axis is 'Urgency' with values 'Low', 'Medium', 'High', and 'Critical'. The y-axis is 'Impact' with values 'Localized', 'Moderate', 'Significant', and 'Extensive'. A localized, low urgency incident is a 5; An extensive, critical urgency incident is a 1; A localized, critical urgency incident is a 3; An extensive, low urgency incident is a 3.

“It’s better to call a false positive than a false negative when sorting incidents. But make sure to iterate on your process – too many false positives and people will stop taking severity 1 seriously.” (Steven Ingram, Data Engineer, Wave HQ)

Use appropriate measures of impact and urgency

Urgency is a measure of how quickly a resolution for the incident is required.

Impact is a measure of how extensive the potential damages are to the business.

Impact

Localized: The incident affects a single person or is confined to a small area.

Moderate: The incident affects a small portion of the organization.

Significant: The incident affects much of the organization, such that it is widely noticed. Client service is disrupted.

Extensive: The incident affects most or all of the organization--it is unavoidable. Client service is dramatically interrupted.

Urgency

Low: Incident resolution is not required immediately. A workaround exists, and the outage is insignificant for business functions.

Medium: Incident resolution should be completed as soon as possible, but a workaround does exist. Some business disruption.

High: The incident needs to be resolved as soon as possible. Delay could cost the business substantially.

Critical: If the incident is not resolved as soon as possible, business viability is directly threatened. All hands on deck are necessary to solve the incident as quickly as possible.

Info-Tech Insight

As an IT professional, you are exposed to the inner workings of the IT department in a way that most of your colleagues in different departments are not. Understand that what seems routine to you might threaten the job security of others.

Assign severity levels to incidents through a voting exercise

Associated Activity icon 1.1.d 1 hour

INPUT: Recent incident tickets

OUTPUT: List of critical systems

Materials: Whiteboard, Markers, Paper, Pen

Participants: Infrastructure manager, Tier 3 and tier 2 representatives, Service desk manager

Leverage the organization’s IT service management (ITSM) software to compile a list of recent incident tickets.

  1. Use records of tickets submitted to IT to compile a list of incidents that the organization has been recently confronted with.
  2. The facilitator of the activity should go over a list of tickets with the group, writing each unique ticket on a whiteboard (and eliminating duplicates in the process).
  3. Each participant should write the final list on a piece of paper, and take a few minutes to privately evaluate the impact and urgency of each incident (separately), recording each value next to the technology on the sheet of paper.
Incident Impact Urgency
Printer failure Localized Low

Interpret the results of the secret voting exercise

Supporting Tool icon 1.1.e Incident Severity Assessment Tool

Record the results of the voting activity.

  • Input the list of incidents and the names of each participant in the Incident Severity Assessment Tool’s second tab.
  • Collect the cards each person filled out during the exercise and input the results for each participant in the tool.
  • The result is an average assignment, ranging from critical (1) to minor (5). This facilitates general understanding of the group’s sentiment on each incident.
  • The fifth and final tab of the tool produces a breakdown of the results by participant, allowing the facilitator to discuss the presence of outliers that might skew the average one way or another.

Sample of Info-Tech's 'Incident Severity Assessment Tool'.

Present and discuss the results of the Incident Severity Assessment Tool

Associated Activity icon 1.1.f 1 hour

INPUT: Results of the Incident Severity Assessment Tool

OUTPUT: List of critical incidents

Materials: Whiteboard, Markers

Participants: Infrastructure manager, Tier 3 and tier 2 representatives, Service desk manager

Critical incidents need to be managed with a separate process. Define those incidents here.

  1. On a whiteboard, list every incident that at least one participant labelled severity 1 or 2 (critical).
  2. Where there is disagreement, have proponents of each different severity classification articulate the reason they hold the views they do.
  3. Refer back to the list of critical applications compiled in step 1.1.a if it is difficult to come to an agreement about what constitutes a critical incident. Ensure that measures of criticality align to the applications identified there.
    • Ultimately, this exercise should produce a list of incidents that the group is satisfied fall into the critical category.
  4. Sort the incidents into two groups: critical/non-critical.

Ensure that lines of authority are clear in the event there is a prioritization conflict.

Set incident management service level agreements

Associated Activity icon 1.1.g 20 minutes

INPUT: Incident severity assessment

OUTPUT: Incident management service level agreements

Materials: Whiteboard, Critical Incident Management SOP

Participants: Incident manager, Problem manager, Business representatives

The ITSM team walks a fine line – avoiding escalation unless necessary, but, when it is required, escalating quickly.

  • Once you have outlined the prioritization scheme for incidents, set reasonable targets for response time.
  • All incident reports should come in through the service desk, and all staff – including tier 1 – should have the opportunity to resolve the incident, even if it is critical.
  • Go around the room, and discuss appropriate response and escalation times with participants, paying special attention to the experienced members in the room. Input the results in a table like the one pictured below. Include the results in section 5.2 of the Critical Incident Management SOP.
Priority Time to Respond Time to Escalate
Severity 1 15 minutes 30 minutes
Severity 2 4 hours 5 hours
Severity 3 8 hours 10 hours
Severity 4 24 hours 36 hours
Severity 5 48 hours 54 hours

Define exceptions

Associated Activity icon 1.1.h 1 hour

INPUT: Results of the Incident Severity Assessment Tool

OUTPUT: Amended list of critical incidents

Materials: Whiteboard, Markers

Participants: Infrastructure manager, Tier 3 and tier 2 representatives, Service desk manager

Certain roles, departments, or equipment might require more urgency

  1. Brainstorm potential exceptions to the critical/non-critical classification procedure identified in step 1.1.d.
  2. Include the results of this exercise in section 3.5 of the Critical Incident Management SOP.
Questions Examples
Which roles/individuals will require faster response times due to the nature of the work they do? —› Executives, sales representatives
Which departments do you consider to be highest priority within the organization? —› Sales, classroom technology
Is there any hardware that could impact the business, necessitating priority? —› Shipping printer, fax machine
Are there any business functions that will change priority based on timing? —› Payroll software near payday, accounting at the end of the fiscal year

Step 1.2: Develop a Critical Incident Workflow

PHASE 1
PHASE 2
PHASE 3
1.11.22.12.23.13.2
Identify Critical IncidentsDevelop a Critical Incident WorkflowIdentify ProblemsDevelop a Problem Management WorkflowPredict ProblemsCommunicate the Importance of Proactivity

This step will walk you through the following activities:

  • Adapt a knowledgebase article template.
  • Define the role of the major/critical incident manager.
  • Outline who will need to be contacted in the event of a critical incident.
  • Compile a list of relevant vendors.
  • Develop key performance indicators to track the effectiveness of your major/critical incident management procedures.

This step involves the following participants:

  • Incident manager
  • Service desk manager
  • Service desk technicians

Outcomes of this step

  • Incident manager job description
  • Critical incident management standard operating procedure

Align your critical incident management process within your IT service management processes

Critical Incident and Problem Management is part of Info-Tech’s series on IT service management processes.

This storyboard will help you integrate your major-incident-management procedure into your existing incident response workflow. For help with managing normal incidents and service requests, see Standardize the Service Desk. For help with responding to catastrophic incidents, see Create a Right-Sized Disaster Recovery Plan.

Because of their severity, major/critical incidents will follow a different workflow

A sample workflow that would not support a major/critical incident, with a branch off of 'Determine severity & categorize' that ends at 'Follow Critical Incident Process'.

Major/critical incidents will follow a different workflow (cont.)

Once a critical incident has been established, engage the critical incident process. Ensure that it is optimized for speedy resolution.

A workflow specifically for major/critical incidents that starts at 'Critical Incident Reported'. If the issue is resolved, 'Initiate Post-Mortem', 'Close ticket', and 'Update Documentation'. If it not resolved, 'Execute disaster recovery plan' before the 'Close ticket' step.

Info-Tech Insight

The more quickly critical incidents are resolved, the less money incidents will cost the organization. Develop a knowledgebase and move towards first call resolution.

Evaluate incoming incident tickets by their level of severity, and consult the knowledgebase

Ensure that your service desk – the primary point of contact for end users – is equipped to prioritize and escalate critical incidents.

  • Tier 1 is at the front line of your organization’s support operation. Ensure that your technicians recognize the significance of critical incidents, and assign each incoming incident report the appropriate level of severity.
  • If the incident affects one of the critical IT services identified in step 1.1 and recorded in the Critical Incident Management SOP, initiate the major/critical incident management process.
  • Technicians should also consult the knowledgebase to determine whether a solution to the incident has already been documented. If it has, the technician should escalate to the appropriate technician. If not, the technician should escalate to the service desk lead, who will initiate the communication plan.

A snippet of a flow diagram beginning with 'Critical incident reported', then 'Cause known?', if No 'Escalate to service desk lead', if Yes 'Escalate to technician'.

Info-Tech Insight

Tier 1 service desk technicians might be at the bottom of the organizational hierarchy, but they are crucial to efficient major/critical incident management. Empower these technicians to systematically declare incidents critical/non-critical.

Adapt a knowledgebase article template

Associated Activity icon 1.2.a 1 hour

The purpose of incident management is the timely resolution of incidents and the restoration of service.

  • To speed up incident resolution time, refer to the knowledgebase. To do this, however, it is necessary to construct an effective knowledgebase. Use Info-Tech’s Knowledgebase Article Template in concert with your ITSM software to document all relevant information about each incident, from reporting to resolution.
  • One of the tragedies of ineffective incident management is the failure to use existing knowledge to resolve incidents. Reduce the mean time to resolution by documenting root causes and providing accessible summaries of the solutions to problems.

Info-Tech Insight

Tier 1 and tier 2 technicians are held back by a lack of institutional knowledge. Codify that knowledge in the knowledgebase and increase their productivity.

Sample of Info-Tech's 'Knowledgebase Article Template'.

Info-Tech Insight

When developers roll out new projects, they should seed the knowledgebase with 5-10 likely incidents to streamline incident management when it comes up.

Adapt a knowledgebase article template (cont.)

Associated Activity icon 1.2.a 1 hour

Make sure that your knowledgebase articles have all of the following information:

  • Description of the incident
  • Known causes
  • Solution
  • Document owner
  • Prerequisites for resolution
  • Keywords and tags
  • Number of page views

Sample of Info-Tech's 'Knowledgebase Article Template' with notes. Note on the 'Short Description of the Issue' box is 'Describe the issue in plain language'. Note on the table of details is 'ITSM software should automatically record a variety of relevant details'.

Info-Tech Best Practice

Provide attribution to the authors of knowledgebase articles. You should also keep track of a variety of metrics including how helpful the article was to technicians, and how frequently it is accessed, potentially with some sort of incentive tied to performance.

Create a communication plan

The incident manager is responsible for three relationships:

  1. The relationship between the network of IT professionals responsible for resolving the incident and bringing critical applications back on line.
  2. The IT department and the business, who have an interest in the incident being resolved as quickly as possible in order to minimize the potential losses associated with it.
  3. The end users who might be experiencing an interruption, and the help desk which, during the interruption, is likely to experience a higher-than-normal volume of calls.

Establish an incident management communication plan

  1. Incidents are solved when technical support staff collaborate in a war-room-type environment. It is the incident manager’s job to facilitate this communication virtually or otherwise. This could be done through a conference bridge, but other potential options include a dedicated instant messaging platform, or a physical conference. Record the details of the collaboration platform (conference bridge number, etc.) in the Incident Management SOP. This conference bridge is reserved for technical support workers and vendors, if the incident is escalated externally.
  2. When an incident is deemed critical, the affected business units need to be kept aware of how the resolution is progressing. Set up a separate conference bridge where the incident manager can keep non-technical executives abreast of the technical team’s progress.
  3. During a severity 1 or 2 incident, expect the service desk to be flooded with calls. The incident manager should prepare a broadcast to indicate to end users that no technicians are currently available, since all hands are on deck to resolve the major incident. He or she should communicate that the support team has received information about the incident and is working to remedy it.

Info-Tech Insight

Social media platforms like Twitter can be used to provide candid one-way communication to a large number of users. However, be careful about who outside of the company has access to information about outages!

Info-Tech Insight

The incident manager is a switchboard operator. The non-tech staff don’t want access to the technical deliberations, they just want to be kept in the loop. The incident manager is responsible for that communication so that the tech team can focus.

Understand how incident response escalated to tiers 2 and 3 fit into your organization’s hierarchy

During the course of their typical duties, workers responsible for handling escalated incidents do not generally report to the service desk manager.

A hierarchy tree with 'CIO' at the top, then 'Service Desk Manager', 'Infrastructure Manager', and 'Applications Manager'. Tier3 and Tier 2 (Specialists) work under the Infrastructure and Applications Managers, but Service Desk has its own separate hierarchy with Tier 2 and Tier 1 Technicians.

Tiers of technical incident management response

Tier Duties Example
Tier 1 Frontline service, including answering phones, addressing walk-ins, assigning tickets and escalating.
  • Service desk technician
  • Help desk support staff
Tier 2 More senior incident response, though not specialists. Tier 2 provides desk-side support, all of the capabilities of tier 1, with a focus on technology issues that require deeper knowledge and narrower expertise.
  • Senior service desk technician
  • Help desk supervisor
Tier 2 (Specialist) Reports to the infrastructure manager or the applications manager. Tier 2 are specialists who are required when certain permissions or expertise that the help desk does not have are required to resolve incidents.
  • Systems administrator
  • App developer
Tier 3 Handles the most challenging requests and, like tier 2 specialists, they are not dedicated completely to support functions.
  • Senior app developer
  • Network engineer

Define the role of the major/critical incident manager

Supporting Tool icon 1.2.b Use Info-Tech’s Incident Manager job description template

An incident manager is necessary to coordinate the incident resolution process.

  • Use Info-Tech’s Incident Manager job description template as a starting point to collect the roles, responsibilities, and qualifications of an incident manager in a simple, easily accessible format.
  • Depending on the size of the organization, fill in the blank entries including required education and experience, and amend the responsibilities as necessary.
  • For larger organizations, it might be appropriate to split out the role of the major incident manager from the incident manager. This would involve a specific focus on incidents at severity levels one and two.

Sample of Info-Tech's 'Incident Manager' job description.

Info-Tech Best Practice

The incident manager does not have to be a full-time role. Especially in smaller organizations, the incident manager may be an additional role of the service desk manager or another IT professional. It is essential, however, that the incident manager be separate from the problem manager.

Cox Enterprises reduces critical incident resolution time by 75% by streamlining their incident management process

CASE STUDY

Industry: Communications
Source: xMatters, Cox Enterprises

Challenge

Cox Enterprises, headquartered in Georgia, has operations in communications, media, and the automotive sector. One of the largest privately held companies in the United States, Cox has approximately 60,000 employees and some $18 billion in annual revenue. 1,300 of those employees are in IT, and when IT infrastructure suffered a major incident, all 1,300 were notified via email. A flurry of unnecessary emails reduced the effective response rate, and the IT department often had to manually track down people capable of resolving the incident, a process that took more than 30 minutes.

Solution

Cox improved their incident management process, streamlining the escalation process. At the same time, they implemented a software solution that allowed customers, as well as executives and business partners to subscribe to specific events that affected them.

Results

The response rate for critical incidents improved by almost 100%, and the mean time to resolution for serious and critical incidents was reduced by more than 75%. IT operations no longer has to broadcast unnecessary information to uninterested and irrelevant parties.

Map out who will need to be contacted in the event of a critical incident

Associated Activity icon 1.2.c 10 minutes per incident

INPUT: List of critical incidents, List of stakeholders

OUTPUT: List of necessary contacts by incident

Materials: Whiteboard, Markers

Participants: Infrastructure manager, Tier 3 and tier 2 representatives, Service desk manager

  1. Divide a whiteboard into two even sections with a vertical line. On the left side, write the list of critical incidents identified in section 1.1.e.
  2. For each identified incident, have participants list individuals, by name and title, who would need to be contacted in the event that incident occurred.
  3. Some questions that will assist the facilitator:
    • Who has the skills or authority necessary to resolve the incident?
    • Who are the managers whose work will be interrupted, either by their services going down, or by their workers having to drop everything to solve the incident?
    • What would happen if we didn’t notify this person?
  4. Once you have arrived at a list for each incident, sort the necessary contacts into one of two buckets: technical or business.
Incident Contact
Email is down Network engineer (technical)
CMO (business)

Supporting Tool icon Insert the list of contacts into section 5.2 of the Critical Incident Management SOP

Compile a list of relevant vendors

Associated Activity icon 1.2.d 30 minutes

INPUT: List of critical applications

OUTPUT: List of relevant vendors

Materials: Whiteboard, Markers

Participants: Infrastructure manager, Tier 3 and tier 2 representatives, Service desk manager

If your organization’s internal tech support cannot resolve the incident, it may be necessary to involve vendors.

  1. Using the list of critical applications developed in step 1.1.b, develop a list of crucial vendors.
  2. Compile a record of the basic service agreements with each vendor, including costs, time to respond, etc.
  3. Go around the room and ask participants if there are any details about interactions with specific vendors that need to be recorded in the SOP for later use.
  4. Finally, record the name and title of the contract owner (named users, lead account users) and backup contract owner (“if Bob is on vacation, Jill can talk to Cisco”).
  5. Record this information in the Critical Incident Management SOP.

Info-Tech Insight

Escalating to the vendor level is among the most expensive support options. It can cost more than four times as much as internal tier 3 support. Where possible, avoid involving the vendor. But where vendor involvement is necessary, initiate it as soon as possible.

Engage in a post-mortem analysis once the incident is resolved and update the knowledgebase

The goal of major incident management is to restore service as quickly as possible.

  • Identify the full extent of the service disruption.
  • Document the solution that allowed you to close the incident ticket.
  • Engage in root cause analysis to determine the origin of the critical incident. Without root cause analysis, you may find yourself confronting a stream of avoidable incidents.
  • Include the problem in the problem register.
  • Initiate the appropriate changes to reduce the likelihood that the incident will reoccur, and document those changes for the knowledgebase.
  • For details on how to conduct root cause analysis, see phase 2 of this deck.

An example of a major incident post-mortem. A note points to the final 'Why?': 'This should be your selected root cause. It may not always take Five Whys - it could be more or less depending on the problem. The logic flow should always take you from the problem to the root cause.'

Info-Tech Insight

If, even with escalation to tier 3 and to vendors, the incident cannot be resolved, engage your organization’s business continuity plan.

Develop key performance indicators to track the effectiveness of your major/critical incident management procedures

Associated Activity icon 1.2.e 1 hour

INPUT: Current accounting of metrics

OUTPUT: List of incident management KPIs

Materials: Whiteboard, Markers

Participants: Infrastructure manager, Tier 3 and tier 2 representatives, Service desk manager, Business representatives

Demonstrate the value of your incident management process by tracking key performance indicators (KPIs)

  1. KPIs are going to vary by organization.
  2. Brainstorm a list of KPIs that reflect the broader purpose of incident management: to maximize uptime and minimize the cost of interruptions.
  3. Questions to ask participants when developing a list of KPIs:
    • What are the results of good incident management?
    • How can we improve our incident management process in a realistic, achievable way?
    • How well are we accomplishing our goals?
    • How can we quantify this?
  4. Record the results of the brainstorming activity on a whiteboard in a table with the following headings: “Key Performance Indicator,” “Current State,” and “Goal.”
  5. Include the list of KPIs in the Critical Incident Management SOP.

Develop key performance indicators

Associated Activity icon 1.2.e 1 hour

INPUT: Current accounting of metrics

OUTPUT: List of incident management KPIs

Materials: Whiteboard, Markers

Participants: Infrastructure manager, Tier 3 and tier 2 representatives, Service desk manager, Business representatives

Example

Key Performance Indicator Current State Goal State
Percentage of incidents resolved at tier 1. 62% 80%
Mean time to resolution 23 minutes 17 minutes

Info-Tech Insight

Make sure that key performance indicators reflect the underlying concept you are trying to measure. Ask yourself if the KPIs you have come up with can be reduced. Example: the number of people involved in each incident resolution is notable because it reflects the cost of incident management.

If you want additional support, have our analysts guide you through this phase as part of an Info-Tech workshop

Associated Activity icon Book a workshop with our Info-Tech analysts:

Photo of an Info-Tech analyst.
  • To accelerate this project, engage your IT team in an Info-Tech workshop with an Info-Tech analyst team.
  • Info-Tech analyst will join you and your team onsite at your location or welcome you to Info-Tech's historic Toronto office to participate in an innovative onsite workshop.
  • Contact your account manager (www.infotech.com/account), or email Workshops@InfoTech.com for more information.

The following are sample activities that will be conducted by Info-Tech analysts with your team:

1.1

Sample of activity 1.1 'Assign severity levels to incidents through a voting exercise'. Assign severity levels to incidents through a voting exercise

The analyst will guide workshop participants through a secret voting exercise, designed to sort incidents by their impact and urgency. This exercise will result in a list of incidents sorted by their severity.

Phase 1 Guided Implementation

Associated Activity icon Call 1-888-670-8889 or email GuidedImplementations@InfoTech.com for more information.

Complete these steps on your own, or call us to complete a guided implementation. A guided implementation is a series of 2-3 advisory calls that help you execute each phase of a project. They are included in most advisory memberships.

Guided Implementation 1: Identify and manage major/critical incidents

Proposed Time to Completion: 3 weeks
Step 1.1: Identify critical incidents Step 1.2: Create a critical incident workflow
Start with an analyst kick-off call:
  • Outline the potential benefits of a critical incident management procedure.
Review findings with analyst:
  • Review the results of the voting exercise and the list of exceptions.
Then complete these activities…
  • Define the benefits of a critical incident management procedure.
  • Brainstorm critical applications.
  • Assign severity levels to incidents through a voting activity.
  • Present and discuss the results of the voting exercise to all of the participants.
  • Develop a list of exceptions to the critical incident classification scheme.
Then complete these activities…
  • Adapt a knowledgebase article template.
  • Define the role of the major/critical incident manager.
  • Outline who will need to be contacted in the event of a critical incident.
  • Compile a list of relevant vendors.
  • Develop key performance indicators to track the effectiveness of your major/critical incident management procedures.
With these tools & templates:
  • Critical Incident Management SOP
  • Incident Severity Assessment Tool
  • Knowledgebase Article Template
With these tools & templates:
  • Critical Incident Management SOP
  • Incident Manager job description template

Phase 1 Results & Insights

  • Don’t risk muddling the chain of command during a crisis. Streamline the process. When senior technical staff are working on incidents, they report to the service desk manager.

PHASE

Develop Problem Management Procedures

Step 2.1: Identify Problems

PHASE 1
PHASE 2
PHASE 3
1.11.22.12.23.13.2
Identify Critical IncidentsDevelop a Critical Incident WorkflowIdentify ProblemsDevelop a Problem Management WorkflowPredict ProblemsCommunicate the Importance of Proactivity

This step will walk you through the following activities:

  • Define the problem manager’s role
  • Separate incidents from problems
  • Develop a process to match related incidents
  • Create a problem ticket

This step involves the following participants:

  • Problem manager
  • Service desk manager
  • Tier 2 service desk support
  • Tier 3 service desk support

Outcomes of this step

  • Problem manager job description
  • Separated list of incidents and problems
  • Incident matching process
  • Problem ticket template

Understand the problem lifecycle

  1. Problem Identified
  2. Problem Investigation Begins
  3. Workaround Identified
  4. Root Cause Identified
  5. Solution Identified
  6. Change Management Initiated
  7. Solution Implemented

Problem management workflow

An example problem management workflow, beginning at 'Incident Reported'. If it is not critical and non-recurring there is a shorter path to 'Close'.

Problem management cannot be separated from incident management

Firefighters will always be necessary, but wouldn’t it be nice if they had more time off?

Incident Management

Incidents are like fires, and the incident manager is like a firefighter. He or she is responsible for immediately putting out the blaze, protecting surrounding homes, and protecting the lives of residents.

Incident management and problem management are inseparable, but it is important to remember that incidents and problems are different: their management requires a different set of skills.

Problem Management

Problem managers are like fire chiefs, looking for the underlying causes of fires. They are responsible for investigating root causes, tracking incident metrics, noticing deviations, and championing the changes necessary to reduce incident occurrence and severity. The problem manager role is less like that of a firefighter, and more like that of a legislator who mandates fireproof insulation, or the fire marshal who limits occupancy.

Info-Tech Insight

Incident management and problem management require a different set of skills. Unlike the incident manager, the problem manager’s first goal is not necessarily bringing service back up immediately. This can lead to conflict where the relationship is not properly managed.

Understand the importance of reactive problem management

Incidents can be indicators of underlying problems. Incident resolution is, at best, a temporary fix.

  • In the wake of a particularly serious IT incident, or a rash of less serious IT incidents, problem management becomes imperative.
  • Without a rigorous problem management process – without a root cause analysis procedure – your organization will confront the same incidents over and over again in a game of ITSM Whack-a-Mole.
  • For dedicated support workers, this represents an increase in their daily workload and a lengthening of their incident backlog. For non-dedicated support – those devoted to project work in the applications or infrastructure silos – unnecessary incident management work means putting project work off to the side
  • Stop whacking moles and unplug the machine. Engage in problem management to improve your organization’s efficiency.

Ensure that everyone involved in the problem management process has the authority to complete his/her tasks

It is not the problem manager’s job to find workarounds. He or she needs the resources to conduct exhaustive investigations.

  • Problem managers and the problem management team cannot do their jobs unless they have access to all of the data they need.
  • The problem manager is accountable for the discovery and resolution of problems, and he or she needs the resources to be able to carry this duty out.
  • To this end, the problem manager – which may not be a dedicated role, and may not even be a part of the dedicated support team – should have the authority to access the ITSM tool’s incident record, the ability to collect data during incidents (server logs, for example), and the authority to convene regular meetings where the proactive problem management is undertaken.

Problem manager — Find a permanent solution to persistent problems

Incident manager — Find an immediate workaround, restore service as quickly as possible

Info-Tech Best Practice

It is unfair to hold a worker accountable for something that he or she cannot control. If the problem manager is going to be accountable for the problem management process, they need the authority to be able to confront IT problems.

Define the problem manager’s role

Supporting Tool icon 2.1.a Use Info-Tech’s Problem Manager job description

The problem manager need not be a dedicated role, especially in smaller organizations.

  • Because problem managers are not necessarily as concerned with the immediate restoration of service, the incident manager’s primary motivation, the two roles are frequently in conflict. For example, the incident manager wants to reboot a server and the problem manager wants to pull logs.
  • Some of the problem manager’s typical responsibilities include:
    • Updating the knowledgebase to keep track of known problems.
    • Facilitating root cause analysis sessions and reporting their results.
    • Analyzing incident trends to notice potential problems and mitigate their effects.
  • Info-Tech’s Problem Manager job description offers a simple template outlining the role and responsibilities of a problem manager. Customize it with your organization’s details.

Sample of Info-Tech's 'Problem Manager' job description.

Understand what constitutes a problem

Recurring incidents have the potential to imperil IT operations.

  • ITIL defines a problem as “the unknown cause of one or more incidents.”
  • The following are indicators of problems:
    • A set of recurring incidents.
    • A critical incident that impacts a large number of end users or has a highly disruptive and negative impact on the business.
    • System reports that identify abnormal operations or functionality.

A visualization of a large circle labelled 'Problem' connected to multiple 'Incidents' of a similar color.

“A problem is just an incident without a root cause that you know of.” (Steven Ingram, Data Engineer, Wave HQ)

Separate incidents from problems

Associated Activity icon 2.1.b 45 minutes

INPUT: List of common IT concerns

OUTPUT: Separated list of problems and incidents

Materials: Whiteboard, Markers

Participants: Infrastructure manager, Tier 3 and tier 2 representatives, Service desk manager

Understanding how incidents and problems differ is fundamental to effective ITSM.

  1. Go around the room and encourage participants to list common issues that have been reported to them or that they have experienced personally. Record this list on a whiteboard.
  2. Once you have exhausted the list of IT issues, have all participants sort the issues into incidents and problems (define what justifies inclusion in the problem register).
  3. Questions to ask:
    • Should problem management be invoked for any critical system failure?
    • Should it be invoked in the event of a prolonged outage?
    • How impactful and frequent must the problem be to be included in the register?
  4. Every issue should be sorted into the problem category or the incident category. If there is significant disagreement, encourage the proponents of the different views to make their cases to the group. Include the results in section 4.1 of the Problem Management SOP.

Note: The purpose of this activity is to establish an understanding of the distinction between problems and incidents. Disagreement will help participants articulate why they classify issues as they do.

Develop a process to match related incidents

Associated Activity icon 2.1.c 1 hour

INPUT: Incident records from ITSM tool

OUTPUT: List of potential problems

Materials: Whiteboard, Markers

Participants: Infrastructure manager, Problem manager, Members of the problem management team

Conduct a facilitated discussion about the best methods for incident matching in the organization.

  1. Access incident reports for a manageable period before the meeting (depending on the number of incident reports your organization receives, this might be one day, several days, or a week or more), and distribute printouts detailing them to everyone in the room.
  2. Match incidents based on the following criteria:
    • Similar symptoms
    • Similar categories
    • User groups impacted
    • Common application
    • Common hardware
  3. Have each participant go through the incidents and try to sort them based on the categories identified above, or based on any categories that might be specific to your organization.
  4. Go around the room and look for patterns. Who grouped which incidents and why? What were some disagreements? Have participants defend their incident matching.
  5. Record the list of potential problems you have identified in section 4.2 of the Problem Management SOP.

Covariation is necessary but insufficient to establish causation

Correlation might not equal causation, but establishing a correlation is the first step.

  • The idea that correlation does not imply causation is useful to remember: there can be covariation between two variables, but no clear causation.
  • Example: Divorce rates in Maine over the period from 2000 to 2009 correlate with the per capita consumption of margarine over the same period. Barring the discovery of some as yet unknown property of margarine, the likelihood that a causal relationship exists here is low.
  • A graph showing the 'Divorce rate in Maine correlated with Per capita consumption of margarine'. (Source: tylervigen.com)

  • With that said, the constant reminders that “correlation does not equal causation” should not obscure the fact that correlation is often the first step in establishing a causal relationship between two variables.
  • Example: Fire and smoke correlate. In order to establish that fire causes smoke, we must first ascertain that they co-vary – where there’s smoke, there’s fire.

“Empirically observed correlation is a necessary but not sufficient condition for causality or, colloquially: Correlation is not causation – but it sure is a hint.” (Edward Tufte, Emeritus Professor of Political Science, Statistics, and Computer Science, Yale University)

Create a problem ticket

Supporting Tool icon 2.1.c Use Info-Tech’s Problem Ticket Template

When a problem is detected, a ticket should be opened, similar to an incident ticket.

  • Once the problem management team has identified a problem (or ITSM software has flagged a suspicious trend), it is essential that you create a problem ticket to track that problem through to its resolution.
  • The problem ticket will contain details about the interruption pulled from the incident ticket that initially inspired the problem, or input manually if the ticket is being opened proactively (see phase 3 of this storyboard).
  • The ticket’s status should start as “open.” Once you have identified the root cause, alter the status to “pending change.” Once the change management procedure has been completed, change the status to “resolved” and close the ticket.
  • Populate the ticket with information gleaned through the processes explored in the next step.

Sample of Info-Tech's 'Problem Ticket Template'.

Leverage your ITSM tool to track incidents and open problem tickets

All widely available ITSM tools can be used to track incident data. Use these data to track potential problems.

  • ITSM tools have ITIL standards baked right in, including best practice reports and KPIs. Take advantage of these reports and move away from an ad hoc approach to incident management.
  • Every incident has an underlying cause, but not all of these causes are worth confronting. Use your ITSM software to track incident occurrence and determine where root cause analysis is required.
  • Questions to ask:
    • Are particular incidents being reported by the same people over and over again?
    • Is the same sort of incident reoccurring at a similar time over and over again? (Is the server crashing every Tuesday at 10am?)
    • What direction are incident reports trending?

List of vendors: 'A wide variety of vendors offer enterprise ITSM software that has problem management capabilities.'

Step 2.2: Develop a Problem Management Workflow

PHASE 1
PHASE 2
PHASE 3
1.11.22.12.23.13.2
Identify Critical IncidentsDevelop a Critical Incident WorkflowIdentify ProblemsDevelop a Problem Management WorkflowPredict ProblemsCommunicate the Importance of Proactivity

This step will walk you through the following activities:

  • Conduct a sample root cause analysis
  • Hone the problem management team’s skills
  • Evaluate your organization's tolerance for different types of risk
  • Play priority poker to determine problem priority
  • Develop key performance indicators to track the success of your problem management process
  • Create a meeting schedule for the problem management team

This step involves the following participants:

  • Problem manager
  • Service desk manager
  • Tier 2 service desk support
  • Tier 3 service desk support

Outcomes of this step

  • An understanding of root cause analysis techniques
  • A framework for evaluating risk, and a prioritized list of IT problems
  • A list of key performance indicators
  • A meeting schedule for the problem management team

Identify the root cause of the problem

Once you have identified that a problem is occurring using incident matching techniques, find the root cause of the problem to prevent its recurrence.

  • Root cause analysis is generally employed to determine the origin of an IT problem in order to prevent its recurrence.
  • It involves identifying what has caused the problem specifically, producing factual data to back this up, and demonstrating the logical relationship between cause and effect, generally through some sort of visualization like a diagram.
  • Ideally, once the problem management team has produced a root cause analysis, anyone interested in the problem – whether they have an IT background or not – should be able to understand the conclusions.

Sample of techniques for finding the root cause of the problem.

Info-Tech Insight

Incidents with the same root cause might have different symptoms. Use retroactive matching as part of the problem management process to uncover these incidents and prevent them from recurring.

Root cause analysis is a group effort

The problem management process is an involved process requiring the allocation of resources from a variety of different business areas.

  • The problem manager oversees the root cause analysis team, which is convened to investigate problems, and comprises subject matter experts (specialized tier 2 and tier 3 staff who work in infrastructure or applications), along with any relevant vendors and service providers, who are required to collect and analyze diagnostic data.
  • Use the information gathered in this storyboard’s first phase on incident management to determine who would form the root cause analysis team that will be responsible for resolving incidents of different types. Note: if a member of staff is necessary to find a workaround for a particular issue, that person is more than likely to be involved in the root cause analysis process.
  • The team should convene remotely or in person to conduct their activities.

Example RCA team

  • Problem Manager
  • Senior Applications Developer
  • Business Analyst
  • Network Engineer

A basic formula for who should be a part of the 'Root Cause Analysis' team: 'Problem Manager', 'Upper-Level Support', 'Vendor Support'.

Root cause analysis can be conducted in a variety of ways

Brainstorming/ Process of Elimination — After brainstorming, identify which possible causes are not the service issue’s root cause by removing unlikely causes.

The Five Whys — Use reverse engineering to delve deeper into a service issue to identify a problem’s root cause.

Ishikawa/Fishbone Diagram — Use an Ishikawa/fishbone diagram to identify and narrow down possible causes by categories.

Leverage root cause analysis techniques (process of elimination)

Using the process of elimination might seem obvious, but it can be a powerful tool to determine root causes.

  • To use the process of elimination to determine root cause, gather the members of the RCA team together once the dust has settled to brainstorm a list of potential causes.
  • Like all brainstorming exercises, remember that the purpose is to gather the widest possible variety of perspectives, so be sure not to eliminate any suggested causes out of hand.
  • Once you have an exhaustive list of potential causes, you can begin the process of eliminating unlikely causes in order to arrive at a list of likely potential causes.

Example

Problem: The microwave isn’t working; everyone’s fish is cold.

Potential Causes (Brainstormed)

  • The fish is un-heatable
  • Power has gone out
  • Users are improperly using the microwave
  • The microwave is unplugged
  • The microwave is broken

The strikethroughs represent unlikely causes or causes that have been eliminated empirically by investigation.

Leverage root cause analysis techniques (the five whys)

Repeatedly asking “why” might seem like an overly simplistic approach to uncovering root cause, but it has the potential to useful

  • It can be useful, when confronting a problem, to start with the end result and work backwards.
  • According to Olivier Serrat, a knowledge management specialist at the Asian Development Bank, there are three key components that define successful use of the five whys: “(i) accurate and complete statements of problems, (ii) complete honesty in answering the questions, (iii) and the determination to get to the bottom of problems and resolve them.”
  • As a group, develop a consensus around the problem statement. Go around the room and have each person suggest a potential reason for its occurrence. Repeat the process for each potential reason (ask “why?”) until there are no more potential causes to explore.
  • Note: The total number of “whys” might be more or less than five.

Example

Problem: the microwave in the office isn’t working – everyone’s fish is cold. Why?

  • The microwave isn’t plugged in.
    Why?
    • Employee A unplugged it because it kept tripping the breaker.
      Why?
      • There are too many devices plugged in, overloading the circuit.
        Why?
        • Everyone on the second floor brought in space heaters.
          Why?
          • A poorly insulated window means the office is cold.

Leverage root cause analysis techniques (Ishikawa/fishbone diagram)

Use an Ishsikawa/fishbone diagram to sort potential causes by category and match them to the problem.

  • The first step in creating a fishbone diagram is agreeing on a problem statement and populating a box on the right hand side of a whiteboard or a piece of chart paper.
  • Draw a horizontal line left from the box and draw several ribs on either side that will represent the categories of causes you will explore.
  • Label each rib with relevant categories. In the IT context, consider cause categories like: operating system, applications, network, security, server, hardware, and storage. Go around the room and ask, “What causes this problem to happen?” Every result produced should fit into one of the identified categories. Place it there, and continue to brainstorm sub-causes.

An example of an Ishikawa/fishbone diagram, used for root cause analysis. The main problem is 'The microwave isn't working; everyone's fish is cold', and it branches off into categories 'Equipment', 'People', 'Materials', and 'Environment', listing potential causes in each category.

Info-Tech Best Practice

Avoid naming individuals in the fishbone diagram or during any of the other RCA exercises you might undertake. Where possible use titles. The goal of the root cause analysis team is not to lay blame for disciplinary reasons or zero in on a guilty party.

Conduct a sample root cause analysis

Associated Activity icon 2.2.a 1 hour

INPUT: List of problems, RCA techniques

OUTPUT: Sample list of root causes for non-IT issues

Materials: Whiteboard, pens and paper

Participants: Problem manager, Problem management team

Root cause analysis is the single most important duty of the problem manager and his or her team.

  1. Gather the group in a room with an available whiteboard and select a problem from Info-Tech’s List of Sample Problems.
  2. Once you have decided which sample problem you would like to analyze (this can be done individually by the facilitator or by the group), divide the room into several small groups (depending on how many people there are), and distribute pieces of paper.
  3. Have the group brainstorm potential causes for their assigned problem and document them on the diagram based on categories. Note: at this stage, members of the group should feel free to use any of the identified techniques.
  4. Have each group share their identified options and discuss how these options can be narrowed down and used to determine a single root cause through further investigation.

Info-Tech Insight

The nature of the appropriate root cause analysis technique is going to vary with organizational contexts. For some larger organizations with more mature IT operations and a dedicated problem management staff, it might be advisable to engage in more sophisticated root cause analysis.

Validate the root cause that the problem management team has identified

A root cause that has not been validated is simply a hypothesis: the problem management team must work quickly to verify the RCA’s results.

  • Validate the root cause using hard evidence: do not proceed to change management without validating the root cause using data.
  • If possible, the problem management team should work to recreate the incident in order to verify the root cause. Recreating the disruption is the clearest evidence of a root cause, but it is not always practical.
  • Note: this likely goes without saying, but it should only be replicated in a controlled setting – not in a production environment. If the root cause of a server outage is found to be water damage, do not take a water bottle to your remaining servers!

Info-Tech Insight

Forgoing a root cause validation step creates the risk of wasting all the hard work of the previous steps by preparing fixes that will not successfully resolve service problems.

Info-Tech Best Practice

When replicating service disruptions, it is not necessary to replicate the outage’s scope. If an entire department’s desktop computers go down, it may only be necessary to replicate the incident on a single computer, for example.

NASA lost its $125 million Mars Climate Orbiter and used root cause analysis to figure out why

CASE STUDY

Industry: Space Exploration
Source: NASA, ThinkReliability

Challenge

The National Aeronautics and Space Administration (NASA) is a US federal agency tasked with exploring space. Part of that mandate involves exploring celestial bodies in the Solar System. To that end, NASA launched the Mars Climate Orbiter in 1998, intended to enter Martian orbit and gather data about the Red Planet. The trip took about nine months, but when the Orbiter arrived, disaster struck. It made contact with the Martian atmosphere and encountered unforeseen atmospheric stress, losing contact with NASA.

Solution

The Orbiter cost about $125 million and had traveled almost 700 million km from home to reach its destination. To discover what went wrong, NASA’s Jet Propulsion Laboratory (JPL) commissioned a report from Mishap Investigation Board (MIB), which received briefings on the spacecraft, its software, and its navigation, and conducted its root cause analysis that involved individual brainstorming and group discussion.

Results

The MIB and others tasked with investigating the root cause of the Orbiter’s unfortunate Martian encounter determined that the root cause was the failure to convert Imperial units to metric units. The JPL’s director summarized the findings: “Our inability to recognize and correct this simple error has had major implications.” In response to the MCO’s loss, NASA created new workplans, engaged in preventative fault-tree analyses, and began employing independent peer review of all of its missions.

Identify solutions based on the validated root cause

Leverage your understanding of root causes to develop effective and lasting solutions to IT problems.

  • Once the problem management team has identified and validated a root cause, it becomes the team’s responsibility to identify a permanent solution to the problem that will prevent it from recurring in any form.
  • The nature of the solution is, predictably, going to vary with the problem at hand, and the permissions and expertise required to implement that solution are going to vary as well.
  • Ensure that the problem management team understands who in the organization they can forward their findings to.
  • Solutions can be simple: if the root cause of the problem is a change made by a person who did not have the appropriate knowledge or experience to be making it, an effective and permanent solution might be locking that person and others like him/her out of the system using administrator controls.
  • Solutions can be more involved: if the cause of a persistent problem, like the repeated crashing of an application, is determined to be the result of a piece of bad code, tier 3 applications developers may have to patch it.

Info-Tech Insight

The best solution to the problem is one that requires the fewest resources. If the solution can be implemented in a timely manner by a lower-level tech, this is the ideal outcome. This is true even where the problem is serious.

Root cause analysis may not always be fruitful

Even the most well-equipped problem management team is going to confront problems that defy simple resolution.

The problem management team has several options if a root cause analysis doesn’t produce a useful result in its first iteration.

Option Description
Seek out greater expertise Escalate the investigation to a higher level of the support structure.
  • Contact a subject matter expert to leverage their knowledge of the technology or service involved with the problem.
  • Level 2 or level 3 support may have some insight to the problem based on past experiences.
Engage vendors Make enquiries with the vendor to determine if they are aware of the problem – it may be a known error with a course of action already outlined.
Apply multiple iterations (try again) If root cause cannot be found during the first attempt, go back to the brainstorming phase and investigate alternative causes.
Live with it Not an ideal option, but a possible reality if a viable workaround exists and the impact of the problem is manageable. If the problem is low risk, this is more advisable.

Hone the problem management team’s skills

Associated Activity icon 2.2.b 1 hour

INPUT: List of problems, List of causes

OUTPUT: List of potential solutions to the sample problems

Materials: Whiteboard, pens and paper

Participants: Problem manager, Problem management team

Using Info-Tech’s list of sample problems as a starting point, validate the root causes of sample problems.

  1. Write the list of potential root causes developed in 2.2.a on a whiteboard.
  2. Brainstorm potential ways that the group could validate the root cause (the sorts of evidence that would be required and available, whether or not it would be feasible to replicate the problem).
  3. Assuming, then, that the root causes identified can be considered validated, go around the room and brainstorm potential solutions. What changes can be implemented that would eliminate the root cause and prevent the incidents associated with the problem from recurring?
  4. Discuss the solutions as a group, and decide on an appropriate solution for each problem.

Info-Tech Insight

Not all solutions are created equal. Just because a particular change will eliminate the root cause of a problem, it does not mean that it is ideal (e.g. firing all staff would reduce the number of password lockouts, but it would come with other problems).

Initiate change management where necessary

Following the successful identification and validation of a root cause, and the identification of a solution, the change management process begins.

  • The role of the problem manager is to identify where changes need to be made, and what those changes need to be. The problem manager is not responsible for implementing that change, though they are responsible for everything leading up to that point.
  • The problem cannot be considered “closed” until change management has successfully implemented the solution. The problem management team needs to keep track of problem tickets as they (the tickets) worm their way through the change management process, but their role beyond that is limited.
  • Using ITSM tools, the problem management team can update the status of each problem and this should be broadcast to all stakeholders using the problem management dashboard.
  1. Problem Investigation Begins
  2. Workaround Identified
  3. Root Cause Identified
  4. Solution Identified
  5. Change Management Initiated
  6. Solution Implemented

The Four Types of Work (from The Phoenix Project)

  1. Business Projects: IT works on a project for the business, usually overseen by a central project management office.
  2. IT Projects: IT engages in operations projects that are tied to the business, along with internal improvement projects.
  3. Changes: IT implements process improvements and other alterations to existing processes.
  4. Unplanned Work: IT often finds itself pulled into the resolution of operational incidents, taking time away from project work.

Reduce unplanned work through problem management; reap significant business benefits

Four gears of different sizes, largest being 'Business Projects', then 'IT projects', then 'Changes', and the smallest is also highlighted 'Unplanned work'.
  • Incidents involve service outages or service degradation, and to the extent that these are intolerable to the organization, they need to be addressed by the service desk and by upper tiers of support within the SLA when they occur.
  • Problem management, though, begins as soon as a problem ticket is opened, and, since the incident management team has already restored service and discovered and documented a workaround, the problem management team has a certain level of temporal discretion when it comes to identifying the root cause and forwarding a solution to change management.
  • Problems, in short, are schedulable: reduce the volume of unplanned work by identifying problems and placing them in the change management queue, where they become project work, and where the benefits of effective project management practices can be applied.
  • In sum: reduce unplanned work through problem management; reap significant business benefits.

Problem management is risk management; how much of a threat does an unsolved problem pose?

Understand the risk tolerance curve

  • Some problems are more important than others. Make sure that the majority of your problem management resources are devoted to the problems that are at the top of your queue.
  • Every problem has the potential to cause service interruption or degradation at some point in the future. The problem manager needs to evaluate the problem’s potential impact against the likelihood of its occurrence.
  • Problems that are unlikely to occur, and would have a minimal impact if they did, belong at the bottom of the queue. Problems with the inverse characteristics belong at the top.

A graph of the risk tolerance curve with x-axis 'Frequency' and y-axis 'Impact'. The line is labelled 'Risk tolerance' and is a concave curve starting at high impact, low frequency and ending at low impact, high frequency with a bend toward the origin point. Below the line is 'Tolerable risk' and above it is 'Unacceptable risk'.

Info-Tech Insight

What constitutes acceptable risk will vary by organization, but the framework is consistent. Systematize your risk tolerance framework using the activity in step 2.2.c.

Risk is impact and frequency

Impact and frequency vary with your organization’s priority. Begin to understand how risky problems are by understanding impact and frequency.

Impact

High: High-impact problems are those that have the potential to cause significant disruptions (i.e. it is likely to cause one or more incidents of severity 1 or severity 2).

Medium: Medium-impact problems are likely to cause noticeable disruptions that need to be addressed, but not incidents as critical as those caused by high-impact problems. This category is roughly analogous to severity 3 incidents.

Low: Low-impact problems are unlikely to be the cause of significant disruption, but they will cause disruption. Consider these analogous to severity 4 and 5 incidents.

Frequency

High: High-frequency problems are likely to produce incidents regularly. Whatever the extent of the disruption, that disruption is frequent.

Medium: Medium frequency problems sit in the middle – while they are not a regular occurrence, they happen frequently enough that anyone with experience in the organization’s service management will be familiar with them. The exact frequency will vary by organization, but they will never be all-consuming or unheard of.

Low: Low-frequency problems occur rarely. More experienced service desk management staff might have encountered this problem over the course of their careers, but newer staff will likely have to wait a while before they encounter these problems (though this depends on the size of the organization).

Evaluate your organization's tolerance for different types of risk

Associated Activity icon 2.2.c 1 hour

INPUT: List of problems identified in 2.1.b

OUTPUT: List of up to 10 top problems

Materials: Whiteboard, pens and paper, Risk Assessment Tool

Participants: Problem manager, Problem management team

Problem management is risk management. Using the list of problems developed in 2.1.b, sort problems by their risk.

  1. Begin the exercise by writing the list of problems developed in section 2.1.b on a whiteboard. Have the members of the group add any problems that might have been missed in this initial exercise that they think are likely to occur (or that have occurred in their experience).
  2. On a sheet of paper, have each person draw a chart with three headings: problem, impact, and frequency, and have each person evaluate the impact and urgency of each of the problems by ranking them high, medium, or low.
  3. The facilitator should input the results of the exercise into the Risk Assessment Tool, which will identify the top problems out of the general list (it will identify the top 10 out of a potential 30 problems).
Problem Impact Frequency
Server crashes once per week Medium Medium

Play priority poker to determine problem priority

Associated Activity icon2.2.d 1 hour

INPUT: List of up to 10 risky problems

OUTPUT: Prioritized list of problems

Materials: Index cards, pens or markers

Participants: Problem manager, Problem management team

Prepare a set of index cards. Inscribe each one with a problem from the list of top problems identified in 2.2.c. Create a full set for each participant.

Steps:
  1. Seat everyone randomly around the table.
  2. Have each participant sort their card stack in order of perceived priority, highest on top.
  3. Take the five lowest priority cards and put a tick mark in the upper-right corner. Pass these cards to the person on their left, who should incorporate them into their pile.
  4. Repeat steps 2 and 3 four more times. Duplicates in your hand must be assigned the same priority. In each iteration, pass one fewer card to the left
  5. For the final pass, each participant will write the priority in the upper-left corner.
  6. Collect all the cards for discussion.

Discussion:

Total the number of passes for each problem. A large number indicates a notionally low priority. No passes indicates a high priority. You can also look at averages to establish group priorities (total number of cards in a set minus the number in the upper-left corner averaged over all the participants sets).

Use SMART success metrics to define your objectives

Specific

Make sure the objective is clear and detailed.

Measurable

Objectives are measurable if there are specific metrics assigned to measure success. Metrics should be objective.

Actionable

Objectives become actionable when specific initiatives designed to achieve the objective are identified.

Realistic

Objectives must be achievable given your current resources or known available resources.

Time Bound

An objective without a timeline can be put off indefinitely. Furthermore, measuring success is challenging without a timeline.

Who, what, where, why? How will you measure the extent to which the goal is met? What is the action-oriented verb? Is this within my capabilities? By when: deadline, frequency?

Info-Tech Best Practice

Problem management without proper metrics will be unsuccessful. Use SMART metrics to contribute to the business case for continued investment in problem management that will keep executives on board.

Develop key performance indicators to track the success of your problem management process

Associated Activity icon 2.2.e 30 minutes

INPUT: Business goals, SMART Metrics

OUTPUT: List of key performance indicators

Materials: Whiteboard, pens and paper

Participants: Problem manager, Problem management team

Key performance indicators for problem management are going to be fairly constant across different organizations, but it is still valuable to go around the group and have a discussion about the merits of the obvious KPIs.

  1. Go around the group and have participants pitch categories that potential KPIs would fit into.
    • These buckets might include change management, root cause analysis, service restoration, problem frequency/severity, and incident resolution, thought this list is not exhaustive.
  2. Write the broad categories on the board and begin to brainstorm KPIs. The facilitator should record each KPI in the category it most closely aligns with.
  3. Once the brainstorming is complete, go through the KPIs and eliminate duplicates and any that do not align with the SMART metrics (e.g. if they are unrealistic or unspecific).
  4. Once you have arrived at a list of key performance indicators, record them in section 7.1 of the Problem Management SOP.

List of key performance indicators

Key performance indicators might vary by organization, but at their core they should be relatively constant. (Sources: ISACA, ITSM Review)

Key performance indicator Description
Number of incidents per problem When conducting reactive problem management, how many incidents are linked to each problem ticket?
Mean time to root cause (MTRC) How long does it take the problem management team to find the root cause of the problem?
Problem resolution effort How much does it cost to resolve the problem? (To measure this, use mean time to root cause and FTE metrics).
Provision of root cause analysis documentation How often is the problem management team meeting its RCA report SLA?
Average problem severity How many problems are at the higher end of the risk scale? The fewer the better.

Track the status of ongoing problems using a problem management dashboard

Tracking problems by their origins and status is an effective way to ensure accountability and resolution.

  • The dashboard can be created using an existing service management tool or, at the very least, the inputs from a service management tool.
  • The problem dashboard should have the following information, ideally presented in an aesthetically pleasing way:
    • The number of problem tickets opened for a given period (generally monthly, but this could vary by organization).
    • Number of problem tickets opened by their assigned priorities.
    • Length of time each problem has been opened.
    • Problems’ status (open, resolved, new, assigned, cause found, risk accepted etc.) sorted by priority.
    • Problems’ status sorted by department.
    • Mean time to resolution.
    • Mean time to diagnosis.
  • The information in the problem dashboard should be available to members of the problem management team, as well as to the business – the former to streamline the problem management process by keeping a central repository of problem knowledge, and the latter to improve transparency and to justify the problem management squad’s effectiveness.

Info-Tech Insight

If impact and frequency are low, a problem is likely to be near the bottom of the priority list. It may not be worth it to devote resources to resolving the problem. This is the “risk accepted” category. The problem is in the register and should be added to the KEDB, but it has gone gold.

Create a meeting schedule for the problem management team

Associated Activity icon2.2.f 20 minutes

Optimal problem management, however, involves holding regular meetings that are regular (as opposed to ad hoc), consistent in terms of membership, focused, and retrospective.

  1. Regular: depending on the size of your organization and the amount of data that needs to be parsed, the problem management group should meet once or twice per month.
  2. Consistent: the working group needs to comprise the problem manager, the service desk manager, and staff in tiers 2 and 3 who have expertise in particular applications and processes. The meeting should be scheduled so that its attendees remain consistent over time.
  3. Focused: the meeting should consist of the group going over the problem management dashboard, and integrating new problems into the queue based on their assigned priority. They should also take this time to move problems that have been in the queue for a long time into the “risk accepted” category.
  4. Retrospective: the group should take some time at the end of each meeting to discuss the status of the problems they have identified and prioritized in the last meeting. Any hiccups in the resolution process should be discussed here.

Info-Tech Insight

It bears repeating: this meeting can be scheduled; it is not unplanned work. Having regular proactive problem management meetings has the double effect of reducing the volume of unplanned work through scheduling problem management and reducing incidents.

If you want additional support, have our analysts guide you through this phase as part of an Info-Tech workshop

Associated Activity iconBook a workshop with our Info-Tech analysts:

Photo of an Info-Tech analyst.
  • To accelerate this project, engage your IT team in an Info-Tech workshop with an Info-Tech analyst team.
  • Info-Tech analyst will join you and your team onsite at your location or welcome you to Info-Tech's historic Toronto office to participate in an innovative onsite workshop.
  • Contact your account manager (www.infotech.com/account), or email Workshops@InfoTech.com for more information.

The following are sample activities that will be conducted by Info-Tech analysts with your team:

2.2

Sample of activity 2.2d 'Evaluate your organization's tolerance for different types of risk'. Evaluate your organization’s tolerance for different types of risk

The analyst will guide workshop participants through an exercise that will help them evaluate how risky different IT problems are. Each participant will score problems by their impact and frequency, and the distribution of answers will inform the analysis going forward.

Phase 2 Guided Implementation

Associated Activity iconCall 1-888-670-8889 or email GuidedImplementations@InfoTech.com for more information.

Complete these steps on your own, or call us to complete a guided implementation. A guided implementation is a series of 2-3 advisory calls that help you execute each phase of a project. They are included in most advisory memberships.

Guided Implementation 2: Develop problem management procedures

Proposed Time to Completion: 3 weeks
Step 2.1: Identify problems Step 2.2: Develop a problem management workflow
Start with an analyst kick-off call:
  • Outline the benefits of a problem management regimen and the required resources.
Review findings with analyst:
  • Review the separated lists of incidents and problems.
  • Review the incident matching procedure.
Then complete these activities…
  • Define the problem manager’s role.
  • Separate incidents from problems.
  • Develop a process to match related incidents.
  • Create a problem ticket.
Then complete these activities…
  • Conduct a sample root cause analysis.
  • Hone the problem management team’s skills.
  • Evaluate your organization's tolerance for different types of risk.
  • Play priority poker to determine problem priority.
  • Develop key performance indicators to track the success of your problem management process.
  • Create a meeting schedule for the problem management team.
With these tools & templates:
  • Problem Management SOP
  • Problem Ticket Template
With these tools & templates:
  • Risk Assessment Tool
  • Problem Management SOP

Phase 2 Results & Insights

  • Incidents defy planning, but problem management is schedulable. Schedule problem management; reduce unplanned work.
  • Not all problems are worth solving. If the risks do not outweigh the costs, it might not be worth it to solve problems.


PHASE 3

Engage in Proactive Problem Management

Step 3.1: Predict Problems

PHASE 1
PHASE 2
PHASE 3
1.11.22.12.23.13.2
Identify Critical IncidentsDevelop a Critical Incident WorkflowIdentify ProblemsDevelop a Problem Management WorkflowPredict ProblemsCommunicate the Importance of Proactivity

This step will walk you through the following activities:

  • Identify how to respond to alerts generated by event management
  • Set thresholds for event management

This step involves the following participants:

  • Problem manager
  • Service desk manager
  • Tier 2 service desk support
  • Tier 3 service desk support

Outcomes of this step

  • Completed problem management standard operating procedure (proactive section)

Proactive problem management workflow

The proactive problem management workflow is simple: monitor data sources – specifically event management – and, if certain thresholds are passed, open a problem ticket and initiate the problem management process.

A workflow for proactive problem management starts with 'Event Management System', 'Event Alert', 'Cause for Concern', then either 'Open Problem Ticket' or 'Ignore'.

Info-Tech Insight

Proactive problem management is not complicated. It is, however, important to get the thresholds right. Too many false positives, and the problem management team might ignore the source. Too many false negatives and problems might slip through.

Proactive problem management is about noticing and addressing problems before they cause an interruption

Odds are your organization collects the data you need for proactive problem management. Take advantage of it!

  • According to ITIL, a problem is a cause of one or more incidents, or the revelation that some system is not working the way that it is expected to, even if service has not yet been degraded or interrupted. In other words, nothing needs to have gone amiss – no incidents need to be reported – for a problem to enter the register.
  • That does not mean, however, that these problems are undetectable. ITSM tools collect significant amounts of data about incidents, service requests, and other relevant fields that can be used to inform the problem management process.
  • Instead of waiting for incidents and problems to come to you, seek out problems and initiate change management before end users experience any disruption or degradation.

“A Problem is a problem, whether it has caused an Incident yet or not.” (Rob England, Managing Director, Two Hills Ltd, Blogger at ITskeptic.org)

Proactive problem management = Fewer recurring incidents and prevented critical incidents = Improved service quality = Enhanced business productivity

Incident management, (reactive) problem management, and proactive problem management are similar

Understand these different processes using the structural integrity of built structures as a metaphor.

Incident management: The building has collapsed, and rescuers respond to shut off broken gas lines, ensure that everyone has emerged safely, and control traffic on the street around site of collapse.

(Reactive) problem management: This building collapse was very serious. It cost millions of dollars, and anyone inside could have been hurt. It is in the interest of city inspectors to make preventing another, similar collapse a high priority, so they comb the site for clues and discover that the rebar holding the concrete up had been exposed to the elements and rusted, causing a critical structural failure.

Proactive problem management: Rewind to before the collapse. Mindful of the fact that buildings do occasionally collapse, city workers conduct routine annual inspections, looking for structural flaws that could be potentially problematic. Despite the fact that the building is still functioning normally (i.e. nobody inside notices anything wrong), they notice some exposed rebar, and they order the building condemned until the property managers can get a contractor in to repair the damage and guarantee employee safety.

Info-Tech Insight

Just because a problem has not caused an incident does not mean that it never will. The most effective way to maximize uptime is to prevent interruptions from occurring at all. The business will thank you.

Proactive problem management is the result of data inputs from a variety of sources

Leverage your ITSM tool, vendor/industry information, and personal experience to identify problems.

  • ITSM tools provide records of previous incidents, service requests, and other interactions between end users and the service desk, but this is not the sole source of data that can be used for proactive problem management.
  • Keep your ear to the ground for industry information, as well as specific information from vendors. If other organizations are having problems with a tool you are using, it might be worthwhile to take their reports seriously and have a look at your own circumstances. Similarly, keep your eye on updates offered by vendors. When vendors update their software, sometimes they alter it in ways that can impact the end user experience enough to cause problems. Read patch notes, and don’t sever your relationship with the vendor after purchase.
  • Finally, leverage your IT experience and your personal understanding of how things should work. It’s difficult to quantify, but if something is not adding up—if something seems amiss based on your experience or that of senior staff—it is entirely possible that an incident could emerge and it might be worthwhile to look further into it and to open a problem ticket if the fishing expedition turns up any results.
  1. Data Input
  2. Proactive Problem Management
  3. Problem Identified
  4. Problem Resolved
  5. (Adapted from Neven Zitek)

An event management script helped one company get in front of support calls

CASE STUDY

Industry: Research/Advisory
Source: Anonymous

Challenge

One staff member’s workstation had been infected with a virus that was probing the network with a wide variety of user names and passwords, trying to find an entry point. Along with the obvious security threat, there existed the more mundane concern that workers occasionally found themselves locked out of their machine and needed to contact the service desk to regain access.

Solution

The systems administrator wrote a script that runs hourly to see if there is a problem with an individual’s workstation. The script records the computer's name, the user involved, the reason for the password lockout, and the number of bad login attempts. If the IT technician on duty notices a greater than normal volume of bad password attempts coming from a single account, they will reach out to the account holder and inquire about potential issues.

Results

The IT department has successfully proactively managed two distinct but related problems: first, they have prevented a number of instances of unplanned work by reaching out to potential lockouts before they receive an incident report. They have also successfully leveraged event management to probe for indicators of a security threat before there is a breach.

Gather the necessary data from your ITSM tool

Effective proactive problem management requires a well-developed incident and problem management process.

  • Proactive problem management can save your organization money, but it requires effective data. Remember: when it comes to problem management, the maxim “garbage in, garbage out” applies.
  • Before engaging in a systematic proactive problem management process, ensure that you are gathering comprehensive and accurate data about your IT operations through incident management and reactive problem management systems that not only resolve incidents and problems, but record all relevant information about those problems and incidents.

An illustration of the problem management maxim 'garbage in, garbage out': a garbage can - data-verified= 'Proactive problem management' -> another garbage can.">

“…in order to accurately predict the future in complex trending, the currency that is required is data. Good quality, believable, accurate data.” (Steve White and Robert Kolaczynski, IT process improvement consultants, Kepner-Tregoe)

Leverage event management as an input to proactive problem management

  • Event management provides critical information in the form of alert messages, which are reviewed and acted upon as required to proactively fix a problem and prevent an outage from occurring.
  • Its objectives include:
    • Detect, interpret, and initiate the necessary actions related to alerts.
    • Serve as the foundation for operational monitoring and control.
    • Supply critical operational notifications such as warnings and exceptions.
    • Improve service quality and reporting practices to enable continuous improvement within service management.
  • Understand the event management process:
    1. Event occurs – Deviation in status of component.
    2. Event detection – Monitoring identifies deviation
    3. Event notification – Alert generated and sent
    4. Event evaluation – Review of alert by support staff
    5. Event correlation – Identify necessary actions
    6. Event escalation – Recorded as a problem record and investigated through root cause analysis

Event management will involve a significant number of alerts; separate the serious from the trivial

Events are occurring constantly, but only a portion are serious warnings or errors that need to be addressed.

Event categories:
  • Exceptions: alarms indicate failure
    • Application failure
    • Operating system error
    • Disk error
  • Alerts indicate exceeded thresholds
    • Network usage
    • Disk utilization
    • CPU utilization
  • Normal operation
    • Download completes
    • VPN session spins up
Event alerts are identified as:
  • Informational
    • Logged with no action required
  • Exceptional
    • Forwarded to problem management for root cause analysis
  • Warning
    • Event is evaluated, determined to be problematic, and has action determined

Info-Tech Best Practice

An organization must set its own thresholds and event monitoring criteria based on its operational needs. Events triggering an action should be reviewed via an assessment of the potential impact and the associated risks.

Identify how to respond to alerts generated by event management

Associated Activity icon 3.1.a 30 minutes

INPUT: List of events generated by event management

OUTPUT: Action plan for various events as they occur

Materials: Whiteboard, pens and paper

Participants: Problem manager, Problem management team

Alerts generated by event management are among the most useful data sources for the problem manager, provided the response is appropriate.

  1. Divide the participants into groups (two to three individuals each) and distribute the list of alerts that were experienced by the organization. Have each group analyze the alerts and determine if they require action or can be ignored.
  2. Have each group present their findings to the group. Via discussion, have the group identify the value of event management as a means to monitor and proactively fix issues with their services prior to an experienced issue.
  3. Questions to ask:
    • Why does this event warrant a response?
    • How could we improve our event management process?
    • What event alerts would have helped us with root cause analysis in the past?
  4. Record the findings on the whiteboard and keep them in place for the next activity.

Set thresholds for event management

Associated Activity icon 3.1.a 1 hour

INPUT: List of problems from 2.1c.

OUTPUT: List of problem management thresholds.

Materials: Whiteboard, pens and paper

Participants: Problem manager, Problem management team

Monitoring is more complicated than setting and forgetting. Outline thresholds that will highlight potential threats.

If the proactive problem management team toggles the threshold for an alert too low (e.g. one is generated every time a CPU load reaches 60% capacity), they will generate too many false positives and create far too much work for themselves, generating alert fatigue. If they go the other direction and set their thresholds too high, there will be too many false negatives – problems will slip through and cause future disruptions.

  1. Write the list of problems identified in 2.1.c and their potential root causes, and conduct an activity with the group. The goal of the exercise is to come up with ways the problems could have been spotted before incidents created disruptions.
  2. Questions to ask:
    • What are some benign signs of this problem?
    • Is there something we could have monitored that would have alerted us to this issue before an incident occurred?
    • Should anyone have noticed this problem? Who? Why? How?
  3. Go through this for each of the problems identified and discuss thresholds. When complete, include the list in the Problem Management SOP.

Once the proactive problem management process has produced results, integrate them into the workflow

Proactive problem management fits snugly into the problem management workflow.

Proactive Problem Management
Problem Management
Data Input Problem Investigation Begins
Proactive Problem Management Workaround Identified
Problem Identified —› Root Cause Identified
Solution Identified
Change Management Initiated
Solution Implemented

Proactive problem management differs from reactive problem management in that the problems it identifies do not require an immediate workaround because they have not yet caused incidents. Once a problem has been identified (whether that identification is proactive or reactive), however, it should be placed into the problem register and assigned a priority by the problem management team.

Step 3.2: Communicate the Importance of Proactivity

PHASE 1
PHASE 2
PHASE 3
1.11.22.12.23.13.2
Identify Critical IncidentsDevelop a Critical Incident WorkflowIdentify ProblemsDevelop a Problem Management WorkflowPredict ProblemsCommunicate the Importance of Proactivity

This step will walk you through the following activities:

  • Outline the benefits of incident management to stakeholders in the business
  • Define the resources and the buy-in necessary to conduct effective critical/major incident management
  • Outline the benefits of problem management
  • Define the resources and the buy-in necessary to conduct effective problem management
  • Outline the benefits of proactive problem management

This step involves the following participants:

  • Problem manager
  • Service desk manager
  • Tier 2 service desk support
  • Tier 3 service desk support

Outcomes of this step

  • Completed stakeholder communication deck

Create an effective communication plan

Stakeholder buy-in is crucial. Show, don’t tell, and get executives and end users on board with your incident and problem management plan.

  • Among the most difficult tasks you will be faced with as you seek to improve the incident and problem management process will be getting stakeholders to buy in. Create an effective communications deck that will illustrate to stakeholders why it is important to engage in incident management, problem management, and proactive problem management.
  • After the fire is out, it can be difficult to convince stakeholders to invest in fire prevention; some executives might think that incident management is enough. Change their minds with an effective presentation of the benefits of proactive problem management.

Use Info-Tech’s Incident and Problem Management Communication Deck to make the case for incident and problem management.

Projects without executive buy-in are likely to fail

Properly packaged, proactive problem management is easy to sell because of its obvious benefits for the organization.

  • A study produced by the Project Management Institute reports that, while most respondents do recognize the importance of communication of organizational objectives to key stakeholders as very important, a smaller proportion actually engage in it. (Requirements Management: A Core Competency for Project and Program Success)
  • This lack of an effective organizational relationship between those carrying out the projects and the C-suite executives who putatively oversee them is largely to blame for this lack of success – alignment and buy-in are essential for success.

“[The] lack of alignment of projects to organizational strategy most likely contributes to the surprising result that nearly one half of strategic initiatives (44 percent) are reported as unsuccessful.” (Project Management Institute, The High Cost of Low Performance)

Info-Tech Insight

Instead of bending IT strategy to existing organizational goals, use data to drive those goals. Convince stakeholders about the benefits of incident management, problem management, and proactive problem management, own the initiative, and improve your standing.

Outline the benefits of incident management to stakeholders in the business

Supporting Tool icon 3.2.a Populate Info-Tech’s Incident and Problem Management Communication Deck

Use internal data to make the case for incident management by calculating how much money recurring incidents cost your organization.

  • Use the data collected in your ITSM tool to calculate costs.
  • Multiply the mean time to resolution of major/critical incidents by the number of incidents to produce a measure of total time spent resolving critical incidents.
  • Multiply this figure by the average FTE of the staff responsible for resolving these incidents (if exact figures are not available, reasonable estimates can be used).
    • This will produce a rough approximation of the cost of incident management.
  • Multiply the mean time to resolution by .9, .8, and .7, and repeat the process.
  • Demonstrate that, with effective incident management, these are the costs savings that result when the mean time to incident resolution is decreased by 10%, 20%, and 30%.
  • Critical incidents will generally be handled by upper tier staff, but if an effective knowledgebase empowers tier 1 workers to resolve these incidents without escalation, the savings can be felt there too. Compare the FTE of a tier 1 technician to that of a tier 2 technician or a tier 3 technician to underscore these benefits.

Define the resources and the buy-in necessary to conduct effective critical/major incident management

Supporting Tool icon 3.2.b Populate Info-Tech’s Incident and Problem Management Communication Deck

Resources: In this section of the presentation, outline the resources needed for effective incident management. Using the information you recorded in the incident management SOP, outline the following:

  • Identity of the incident manager and the location of the incident management war room
  • List of critical applications
  • Critical incident management organization chart (tier 2 and 3 staff who will report to the incident manager)
  • List of vendor contacts

Buy-in: Projects without buy-in fail. On this slide outline who will need to approve the new critical incident management process, including the executives responsible for IT (CIO), along with the managers whose staff will occasionally engage in incident response. The process is only as effective as the people who use and implement it.

  • Infrastructure manager
  • Applications manager
  • CIO
  • CEO (if applicable)
  • Service desk staff

Outline the benefits of problem management

Supporting Tool icon 3.2.c Populate Info-Tech’s Incident and Problem Management Communication Deck

Effective problem management will reduce incident volume. Use your organization’s data to demonstrate how this will reduce costs.

  • Effective problem management will result in an understanding of the root cause of incidents, and, once change management has taken the identified solution and implemented it, that particular variety of incident should cease to cause interruptions.
  • Quantify this by using some of the information provided by your ITSM tool earlier leveraged in 3.2.a.
  • Multiply the number of incidents by .9, .8, and .7, and then multiply that number by the mean time to resolution. Using FTE figures, calculate the total value of problem management if it reduces incident volume by 10%, 20%, or 30%.

“We’re so busy putting out fires that we never stop the fires from happening.” (George Jucan, Founder, Organizational Performance Enablers Network)

Define the resources and the buy-in necessary to conduct effective problem management

Supporting Tool icon 3.2.d Populate Info-Tech’s Incident and Problem Management Communication Deck

Resources: problem management is a regular, planned activity. Include information on where the team will meet, who it will be composed of, and the tools they will need to do their jobs. In this section, identify:

  • The problem manager and their duties.
  • The rest of the problem management team: who will meet to review and prioritize problems?
  • What data will the problem management team need access to (incident reports, for example)?
  • Where will the team meet?

Buy-in: without buy-in and cooperation up and down the organizational chart, problem management cannot be effective. Outline whose cooperation is required and what this entails.

  • End users must go through the incident management process and formally report every incident.
  • Support services must create a ticket for each incident and record as much data about them as possible. Remember: garbage in, garbage out!
  • Executives must agree to support the initiative, or else it will fail. Remember: stakeholder buy-in is a top predictor of the success of strategic initiatives.

Outline the benefits of proactive problem management

Supporting Tool icon 3.2.e Populate Info-Tech’s Incident and Problem Management Communication Deck

Proactive problem management is the last step in the incident and problem management process, but it has the potential to deliver extraordinary value.

Benefits

The benefits of proactive problem management should be easy to convey.

  • Effective incident management will reduce mean time to resolution.
  • Effective problem management will reduce incident volume.
  • Effective proactive problem management will reduce incident volume even further.
  • Multiply mean time to incident resolution by .9, .8, and .7, and multiply those numbers by incident volume, but this time reduce incident volume even further (.6, .5, .4) and illustrate the effects of proactive problem management.
Resources and Buy-in

The proactive problem management team will require resources and buy-in.

  • The problem management team will need the authority to open tickets without incident reports attached to them.
  • They will need access to event logs, which means, in terms of buy-in, that they will need to be able to count on a mature event management process.
  • Stakeholders need to be on board. The managers of the silos need to be willing to give up their staff for a defined period, the change management process will need to accommodate incoming requests, and all stakeholders will need to agree on appropriate event thresholds.

ABI: Always be iterating

The problem and incident management process never ends, and it should always be improving.

  • Remember: a problem and incident management process is only as good as its performance. Regularly track the key performance indicators identified in 1.2.d and 2.2.e, and if your organization is facing persistent shortfalls, work to rectify those.
  • Adjust the size of your working group as needed. If you are experiencing a higher than expected volume of problems (and the resultant backlog is intimidating the problem management team and the executive), it might be worth it to expand your problem management operation.
  • If incident resolution times (or other key performance indicators relating to incident management) are sticky, the incident management practice you have in place is not living up to its promise.

Info-Tech Insight

Show, don’t tell. Track key performance indicators so that if one are of the incident or problem management process is lagging behind the others, you’ll be equipped to explain why resolution is necessary, and what you can expect to gain.

If you want additional support, have our analysts guide you through this phase as part of an Info-Tech workshop

Associated Activity iconBook a workshop with our Info-Tech analysts:

Photo of an Info-Tech analyst.
  • To accelerate this project, engage your IT team in an Info-Tech workshop with an Info-Tech analyst team.
  • Info-Tech analyst will join you and your team onsite at your location or welcome you to Info-Tech's historic Toronto office to participate in an innovative onsite workshop.
  • Contact your account manager (www.infotech.com/account), or email Workshops@InfoTech.com for more information.

The following are sample activities that will be conducted by Info-Tech analysts with your team:

3.1

Sample of activity 3.1a 'Set thresholds for event management'. Set thresholds for event management

The analyst will advise workshop participants on how to leverage event management and set thresholds to proactively stamp out problems. The result will be a list of key performance indicators that can be input into the Proactive Problem Management SOP.

Phase 3 Guided Implementation

Associated Activity icon Call 1-888-670-8889 or email GuidedImplementations@InfoTech.com for more information.

Complete these steps on your own, or call us to complete a guided implementation. A guided implementation is a series of 2-3 advisory calls that help you execute each phase of a project. They are included in most advisory memberships.

Guided Implementation 3: Engage in proactive problem management

Proposed Time to Completion: 3 weeks
Step 3.1: Predict Problems Step 3.2: Communicate the Importance of Proactivity
Start with an analyst kick-off call:
  • Outline the required inputs for proactive problem management.
Review findings with analyst:
  • Review proactive problem management techniques.
  • Collate and present the visual SOPs.
Then complete these activities…
  • Identify how to respond to alerts generated by event management.
  • Set thresholds for event management.
Then complete these activities…
  • Outline the benefits of incident management to stakeholders in the business.
  • Define the resources and the buy-in necessary to conduct effective critical/major incident management.
  • Outline the benefits of problem management.
  • Define the resources and the buy-in necessary to conduct effective problem management.
  • Outline the benefits of proactive problem management.
With these tools & templates:
  • Problem Management SOP
With these tools & templates:
  • Incident and Problem Management Communication Deck

Phase 3 Results & Insights

  • Just because a problem has not caused an incident does not mean that it never will. The most effective way to maximize uptime is to prevent interruptions from occurring at all. The business will thank you.

Project step summary

Client Project: Incident and Problem Management

  1. Identify critical incidents.
  2. Develop a critical incident workflow.
  3. Identify problems.
  4. Develop a problem management workflow.
  5. Predict problems.
  6. Communicate the importance of proactivity.

Summary of accomplishment

Knowledge Gained

  • How to separate critical incidents from regular incidents
  • How to detect problems through incident matching
  • How to leverage event management to detect problems before they cause incidents

Processes Optimized

  • Incident management
  • Problem management
  • Proactive problem management

Deliverables Completed

  • List of critical IT services
  • List of key escalation contacts
  • Vendor management details
  • Incident Management Standard Operating Procedure
  • Problem Management Standard Operating Procedure

Research contributors and experts

Photo of Hardy Baker, Incident and Problem Manager, Waste Management Hardy Baker, Incident and Problem Manager
Waste Management

Hardy Baker is an IT professional with more than 20 years of experience managing incidents and problems for a large corporation. He has been responsible for a variety of initiatives at Waste Management including a problem dashboard and the use of social media for incident communication.

Photo of Rishi Bhargava, Co-Founder, Demisto Inc. Rishi Bhargava, Co-Founder
Demisto Inc.

Rishi Bhargava is the co-founder of Demisto Inc., a security operations platform that uses a ChatOps interface to combine intelligent automation and collaboration to help security teams respond to threats. Before Demisto, Rishi served as GM & VP, Software Defined Datacenter for Intel Security Solutions.

Photo of Rob England, Managing Director, Two Hills Ltd, Blogger at Itskeptic.org Rob England, Managing Director
Two Hills Ltd, Blogger at Itskeptic.org

Rob England is a New Zealand-based consultant who specializes in IT management, strategy, governance, and practices. He is the author of numerous books and articles, and his web presence is well established (you may know him better as the “ITSkeptic”).

Photo of Steven Ingram, Data Engineer, Wave HQ Steven Ingram, Data Engineer
Wave HQ

Steven Ingram is an IT professional with over 17 years of integrating users and technology. He specializes in integrating people, processes, and tools with the purpose of driving effective analysis that will produce concrete results for the business.

Photo of George Jucan, Founder, Organizational Performance Enablers Network George Jucan, Founder
Organizational Performance Enablers Network

George Jucan is an internationally recognized project management expert, currently leading the Canadian Committee at International Organization of Standardization (ISO) for the establishment of Project, Programme and Portfolio Management family of standards. He is well known as a successful project management consultant, speaker at public events, trainer, and author of high-impact project management articles.

Photo of Rick Moroz, Associate Director, Information Systems, University of Guelph Rick Moroz, Associate Director, Information Systems
University of Guelph

Rick Moroz is an IT professional with significant experience in the non-profit and charitable sector, along with project management and privacy regulation. In his current role, Rick is responsible for staffing and budgeting for technical services for Guelph’s Alumni Affairs and Development Office.

Bibliography

ASQ. “Fishbone (Ishikawa) Diagram.” ASQ. N.d. Web. November 24, 2014.

“Creating Problem Tickets.” Boston University Information Services and Technology. N.d. Web. November 24, 2016.

Draper, Steve. “Correlation and causation.” University of Glasgow. October 21, 2014. Web. November 24, 2016.

England, Rob. “Measuring Problem Management.” The IT Skeptic. February 1, 2014. Web. November 24, 2016.

England, Rob. Owning ITIL. Two Hills. 2009.

England, Rob. “Rob England: Proactive Problem Management.” December 5, 2012. Web. November 24, 2016.

Galley, Mark. “Improving on the Fishbone: Effective Cause-and-Effect Analysis: Cause Mapping.” ThinkReliability. 2007. Web. November 24, 2016.

Hall, Mark. “Root Cause Analysis Concepts and Best Practices for IT Problem Managers.” Sologic. April 2010. Web. November 24, 2016.

Higginson, Simon. “Four Problem Management SLAs you really can’t live without.” The ITSM Review. February 28, 2013. Web. November 24, 2016.

“How to use the Fishbone Tool for Root Cause Analysis.” Centers for Medicare and Medicaid Services. N.d. Web. November 24, 2016.

“Incident and Problem Management Dashboard.” IBM Knowledge Center. 2009. Web. November 24, 2016.

Isbell, Douglas, and Don Savage. “Mars Climate Orbiter Failure Board Releases Report, Numerous NASA Actions Underway in Response.” National Aeronautics and Space Administration. November 10, 1999. Web. November 24, 2016.

“ITIL—A guide to change management.” UCISA. N.d. Web. November 24, 2016.

“ITIL Incident Management.” BMC. March 21, 2016. Web. November 24, 2016.

IT Infrastructure Library. Best Practice for Service Delivery. Office of Government Commerce. 2001.

“ITSM Roles,” University of Chicago. N.d. Web. November 24, 2016.

Kendrick, Stuart. “Problem Management Dashboard.” Skendric. N.d. Web. November 24, 2016.

Kim, Gene. “Unplanned Work is Silently Killing IT Departments.” Computerworld. April 10, 2006. Web. November 24, 2016.

Kim, Gene et al. The Phoenix Project: A novel about IT, DevOps, and Helping Your Business Win. IT Revolution Press. 2013.

Kloppenborg, Timothy J. and Debbie Tesch. “How Executive Sponsors Influence Project Success.” MIT Sloan Management Review. March 16, 2015. Web. November 24, 2016.

Mars Climate Orbiter Mishap Investigation Board. “Mars Climate Orbiter Mishap Investigation Board: Phase I Report.” National Aeronautics and Space Administration. November 10, 1999. Web. November 24, 2016.

Organization for Economic Co-operation and Development. “Average annual hours actually worked per worker.” OECD.Stat, 2016.

“Problem Management.” Cisco Systems Inc. 2007. Web. November 24, 2016.

“Problem Management.” ISACA. N.d. Web. November 24, 2016.

“Requirements Management: A Core Competency for Project and Program Success.” Project Management Institute. August 2014. Web. November 24, 2016.

Ritchie, George. “Problem Management—Why and How?” Serio Ltd. 2005. Web. November 24, 2015.

“Root Cause Analysis—The Loss of the mars Climate Orbiter.” ThinkReliability. June 2016. Web. November 24, 2016.

Sawyer, Kathy. “Mystery of Orbiter Crash Solved.” Washington Post, October 1, 1999. Web. November 24, 2016.

Serrat, Olivier. “The Five Whys Technique.” Asian Development Bank. February 2009. Web. November 24, 2016.

SysAid, “ITSM Basics: A Simple Introduction to Incident Management.” SysAid Blog. January 27, 2015. Web. September 19, 2015.

“The High Cost of Low Performance.” Project Management Institute. February 2014. Web. November 24, 2016.

Topalovic, Drago. “Major Incident Management—when the going gets tough…” Advisera. July 2015. Web. November 24, 2016.

Vigen, Tyler. “Spurious Correlations” N.d. Web. November 24, 2016.

White, Steve, and Robert Kolaczynski. “Proactive problem management – business need or a necessity.” ServiceTalk. Spring 2012. Web. November 24, 2016.

xMatters, “Major Incident Management Trends” xMatters. 2015. Web. November 24, 2016.

xMatters. “User Profile: Cox Enterprises.” xMatters. September 14, 2015. Web. November 24, 2016.

xMatters. “Best Practices in Major Incident Management Featuring NBN.” October 27, 2015. Web. November 24, 2016.

Zitek, Neven. “ITIL Reactive and Proactive Problem Management: Two sides of the same coin.” Advisera. N.d. Web. November 24, 2016.

About Info-Tech

Info-Tech Research Group is the world’s fastest-growing information technology research and advisory company, proudly serving over 30,000 IT professionals.

We produce unbiased and highly relevant research to help CIOs and IT leaders make strategic, timely, and well-informed decisions. We partner closely with IT teams to provide everything they need, from actionable tools to analyst guidance, ensuring they deliver measurable results for their organizations.

Member Rating

9.5/10
Overall Impact

$13,949
Average $ Saved

20
Average Days Saved

After each Info-Tech experience, we ask our members to quantify the real-time savings, monetary impact, and project improvements our research helped them achieve.

Read what our members are saying

What Is a Blueprint?

A blueprint is designed to be a roadmap, containing a methodology and the tools and templates you need to solve your IT problems.

Each blueprint can be accompanied by a Guided Implementation that provides you access to our world-class analysts to help you get through the project.

Need Extra Help?
Speak With An Analyst

Get the help you need in this 3-phase advisory process. You'll receive 8 touchpoints with our researchers, all included in your membership.

Guided Implementation #1 - Identify and manage major/critical incidents
  • Call #1 - Outline the potential benefits of a critical incident management procedure.
  • Call #2 - Review the results of the voting exercise and the list of exceptions.

Guided Implementation #2 - Develop problem management procedures
  • Call #1 - Outline the benefits of a problem management regimen and the required resources.
  • Call #2 - Review the separated lists of incidents and problems.
  • Call #3 - Review the incident matching procedure.

Guided Implementation #3 - Engage in proactive problem management
  • Call #1 - Outline the required inputs for proactive problem management.
  • Call #2 - Review proactive problem management techniques.
  • Call #3 - Collate and present the visual SOPs.

Authors

John Annand

Fred Chagnon

Jeremy Roberts

Contributors

  • Hardy Baker, Incident and Problem Manager, Waste Management
  • Rishi Bhargava, Co-Founder Demisto Inc.
  • Rob England, Managing Director, Two Hills Ltd.
  • Steven Ingram, Data Engineer, Wave HQ
  • George Jucan, Founder, Organizational Performance Enablers Network
  • Rick Moroz, Associate Director, Information Systems, University of Guelph
Visit our COVID-19 Resource Center and our Cost Management Center
Over 100 analysts waiting to take your call right now: 1-519-432-3550 x2019