DoIT Operational Framework - Section 7.0 - Problem Management

DoIT's ITIL-based process for Problem Management.
DoIT Operational Framework - Section 4.0 - Incident Management

 

DoIT Operational Framework- Section 7.0 – Problem Management

 

Contents

 

7.1 The Value of Problem Management

DoIT’s Operational Framework is an ITIL-based set of policy and procedure that aims to improve the operational processes used to support DoIT managed services. This chapter addresses Problem Management.

7.1.0 Goal

Problems are a cause of one or more Incidents. Problem Management is the process by which DoIT manages the lifecycle of Problems.

 

The goal of Problem Management is to improve services by preventing Incidents and Problems from occurring and promptly responding to Problems as they are discovered.

 

It is often the case that the root cause of Problems is unknown at the time of discovery. The Problem Management process includes investigation and analysis in order to discover root cause.

 

DoIT uses a Problem Management process led by the IT Service Management group to identify, investigate, resolve, and prevent the recurrence of events and conditions that negatively impact customers and users.

7.1.2 Scope

Problem Management applies to any events or incidents that impact or have a likelihood to impact services that DoIT hosts, contracts, or is otherwise obligated to support.

 

DoIT may still offer support for Problems outside of this scope, although support in such cases may be limited and/or on a case-by-case basis.

7.1.3 Benefits

Effective Problem Management provides many benefits to the organization. Some of these benefits include:

·         Incident reduction- Proactively responding to previous Problems will prevent similar incidents in the future.

·         Increased uptime for services- Understanding the root cause of Problems allows Service Owners to proactively prevent unplanned downtime for their services.

·         Risk mitigation- Effectively managing Problems throughout their lifecycle has the added benefit of reducing risk and enhancing the reliability of DoIT supported services.

·         Reduction in effort- For instance, in automating a process in response to a previous Problem.

·         Faster resolution- Managing Problems during and after their discovery allows the Help Desk to be aware of incoming Incident that may be correlated to an open Problem, thus increasing their response and resolution times.

 

Problem Management can also provide meaningful metrics that can be used to measure service levels that can then be used to improve those services. For example:

·         Number of Problems solved proactively, measured in increased uptime for services.

·         Percentage of Incidents resolved using a Knowledge Base article, measured in decreased resolution time.

 

7.2 Problem Management Process

7.2.0 Overview

The Problem Management process described in this section addresses problems that relate to DoIT supported services as well as services provided by external entities that DoIT has an obligation to support.

Problem handling via the SNCC is available 24x7, 365 days a year.

Scalable Problem Response

DoIT’s Problem Response process is fundamentally scalable with a hierarchical escalation path for all situations. The severity of any ongoing situation is constantly assessed and can be escalated vertically at any point in the Problem’s lifecycle.

 

This scalable response allows DoIT to quickly and efficiently respond to situations as they are first discovered and then to scale that response as the severity of the situation grows. It is not necessary to first categorize the situation into predefined categories. Rather, DoIT’s response to any situation is immediate and then ongoing response efforts are scaled to the severity of the Problem.

 

See Section 7.3 for more information on Problem Escalation.

7.2.1 Problem Discovery and Reporting

Many Problems begin their lifecycle as Incidents (See OF- Chapter 4 for more information on Incident Management). However, in addition to these discovery and reporting channels, Problems are often communicated directly to SNCC operators.

 

Monitoring events are also a primary method of discovering Problems. See OF- Chapter 6 for more information on Event Monitoring & Management.

Problem Discovery and Reporting Methods

·         Incident escalation (from Help Desk or support teams)

·         Reported directly to SNCC staff by technologist

·         Detected by monitoring event

·         Novel methods of discovery such as Artificial Intelligence and/or trend analysis.

 

7.2.2 Problem Logging

DoIT strives to acknowledge Problems as soon as they are discovered. Problems are to be logged within 10 minutes of discovery.

 

Upon Problem discovery, a Problem Ticket is created in WiscIT and assigned to the SNCC-Sysops team. It is the responsibility of the Problem Ticket creator to:

1.      Verify accuracy of Short and Detailed Description.

2.      Use known information to determine which Technical Service is the most appropriate and link that Technical Service to the Problem Ticket.

3.      Perform impact analysis to determine affected Technical Services, if any.

4.      Once the Problem is assigned to the SNCC-Sysops Team, SNCC will create Outage(s) for any and all affected Technical Services.

 

Note: Because Problems often entail an unknown root cause, initial information captured in this step is a “best guess”. Classification is subject to change as more information becomes known.

Any existing Incidents Tickets related to the Problem are to be linked to the Problem ticket at the time of creation.

 

7.2.3 Problem Response

7.2.3.0 Problem Response Overview

In responding to a situation that entails impact to DoIT service(s) or heightened risk of impact to DoIT service(s), the following process is to be adhered to:

1.      Document Problem Details in DoIT ITSM Tool (WiscIT)

2.      Assess the severity of the situation

3.      Assess and document any Life or Safety Considerations

4.      Identify the Primary Technologist.

5.      Identify Situation Manager. (Refer to Section 5.3 for Situation Manager Escalation guidelines)

6.      Contact Primary Technologist and/or Situation Manager to begin the Problem Investigation, Analysis, and Resolution Process, if necessary.

7.      Assemble Response Team

8.      Plan the Response

9.      Communicate the response internally

10.  Communicate with customers and external stakeholders

11.  Define the action plan

12.  Perform Post-Problem Actions

 

See section 5.4 for more information on specific Roles and Responsibilities for Problem Response staff.

7.2.3.1 Problem Response Checklist and Essential Resources

The Problem Response Checklist is a resource that can be used by any person in charge of responding to any Problem. It provides an action-oriented list of steps to be taken during Problem Response.

 

Problems with minor severity may only have a single technician responding while Problems with major severity may require a Problem Response team.

 

The Problem Response Checklist and other essential Problem Response resources can be found under KB #97634 in the DoIT Knowledge Base.

7.2.3.2 Problem Response Urgency

Problem Response urgency is determined by the support level of the Service impacted. Support levels vary for different services and are documented in the Configuration Management Database. Support levels may also vary depending on whether it is during business hours or outside of business hours.

 

7.2.3.3 Problem Investigation and Analysis

Once notified of an existing problem, the responding technologist(s) will document steps taken to diagnose the issue and identify the root cause. Diagnosis and root cause may be communicated to SNCC and thus documented by the SNCC staff.

 

Technologists and Situation Managers may have need to request resources from other DoIT service areas during the investigation and analysis phase of Problem Management.

 

The technologist will document any workarounds or known errors that are discovered at any time throughout the investigation and analysis process. If no workaround or known error exists, the technologist will document this by entering “None” in the workarounds field of the Problem Ticket.

 

If the responding technologist or Situation Manager is unable to reach a satisfactory resolution, the problem is to be escalated to the Duty Manager and/or executive management. (Refer to Section 5.3 for escalation guidelines.

7.2.3.4 Problems with Unknown Root Cause

Problems where the Situation Manager is unable to lead Problem resolution efforts so as to identify the root cause of the Problem and/or resolve the Problem are escalated to the Tiger Team.

 

See section 5.7 for more information on Tiger Teams.

 

7.2.3.5 Communication of Problem Status

Technologists and/or Situation Managers that are responding to an open Problem will communicate the Problem status to SNCC Operators at a minimum of every two hours or at an otherwise agreed upon interval, as determined by the support level of the Service.

 

The technologist or Situation Manager will update the Problem ticket with any new information and/or communicate updates to the SNCC, to be updated by the SNCC Operator.

 

SNCC Staff are responsible for updating open Outages with relevant information, as communicated by the Situation Manager, Situation Technologist, or Duty Manager.

 

7.2.3.6 Problem Resolution and Root Cause Evaluation

Once a resolution has been discovered and implemented, the root cause and any steps taken to restore service will be documented in the Problem Ticket. The Cause Code will also be documented in the Problem Ticket.

 

Resolving staff will update any relevant support documentation (such as KB articles or Known Issues) as needed.

 

7.2.4 Problem Review

Once the Problem has been resolved, the following Problem review steps are completed:

·         The lead technologist or Situation Manager documents ‘Problem Review’ with any notes, comments, analysis, etc. regarding the Problem.

·         The ITSM Event Management team documents ‘Monitoring Review’ regarding addition or modification of Event Monitoring in place as it impacts the Problem.

·         SNCC and the Helpdesk documents ‘Documentation Review’ regarding any necessary additions or modifications to KBs and handling docs as it impacts the Problem.

·         The Service Owner or other service-specific management documents ‘Unit Review’ with any notes, comments, analysis, etc. regarding the Problem.

·         ITSM Problem Management reviews the ‘Technical Service (probable root cause)’ and updates as necessary.

·         ITSM Configuration Management reviews and links peripheral Technical Service(s) impacted.

·         If warranted, recommend PIR.

 

7.2.5 Post-Incident Review (PIR)

DoIT IT Service Management will coordinate Post-Incident Review (PIR) as determined necessary by DoIT Management. (Facilitators from other DoIT groups may be called upon to lead this effort.) An assigned staff member from The IT Service Management will perform the following steps:

·         Advise on whether PIR is warranted.

·         Schedule the review.

·         Involve an outside DoIT facilitator if appropriate. Include the following topics in the discussion:

o   What happened?

o   Issues involved?

o   Suggested action items to help resolve identified issues.

·         Document the proceedings of the review and produce one or more PIR documents.

·         Track the suggested action items from PIRs in the Continual Improvement Register (CIR) until addressed or determined as not feasible by DoIT Directors.

The resulting documents may contain high-level review for departments and an in-depth version to help drive improvement within DoIT. Assign and follow up on the action items, enlisting the help of other managers as needed.

 

7.3 Problem Escalation

Problems can be escalated by the Situation Technologist, Situation Manager, Duty Manager, SNCC, or Incident Manager at any time during their lifecycle.

Note: The timeframes for escalation as described below are maximums. Problems with significant impact are to be escalated sooner than the stated time limits or as the situation warrants.

7.3.0 Situation Manager Escalation

Problems that are unresolved within 2 hours of discovery are escalated to a Situation Manager. Problems impacting Services that do not have after hours support are escalated to the Duty Manager after 2 hours instead.

 

See OF Section 7.4.2 for more information on Roles and Responsibilities of a Situation Manager.

7.3.1 Duty Manager Escalation

Problems that are unresolved within 8 hours of discovery are escalated to a Duty Manager. In addition, Problems that have a high impact and/or urgency are also escalated to the Duty Manager.

 

See Section 7.4.3 for more information on Roles and Responsibilities of a Duty Manager.

7.3.2 Vertical Escalation to Executive Management

SNCC Staff, Situation Managers, or Duty Managers may escalate a Problem to executive management using the Vertical Escalation process. Vertical Escalation notifies the executive management team of the Problem via SMS notification. See KB #3632 for more information on Vertical Escalation.

 

See Appendix B for Vertical Escalation Guidelines including Elapsed Times for escalation.

 

7.3.3 Tiger Team Escalation

If a Situation Manager is unable to ascertain root cause within a reasonable time frame, the Problem may be escalated to management, who may then request a Tiger Team to determine root cause. See OF Section 5.7 for more information regarding a Tiger Team.

 

7.4 Problem Response Roles and Responsibilities

7.4.0 Technologist

The situation technologist is the responding technologist with the most direct knowledge of the problematic systems, equipment, infrastructure, etc.

 

Technologists are responsible for performing investigation, analysis, troubleshooting, and other resolution efforts for all Problems that are within the scope of their support unit. Technologists are expected to work in coordination with other technologists, Situation Managers, Duty Managers, et al.

Technologist Duties

·         Respond to issues when high level of technical knowledge is required or ASAP resolution to serious issue is needed

·         Provide technical leadership

·         Perform technical quality assurance

·         When involved with issue resolution, help SNCC staff analyze severity and determine whether a Situation Manager is warranted

·         Improve proactive operational performance by identifying and implementing tools

·         Seek anomalies and refer to SNCC staff

·         Design and configure major changes and implementations

·         Perform Quality Assurance for device configuration

·         Participate in capacity planning; perform management processes

7.4.1 SNCC Staff

SNCC staff are available 24/7/365 and are the primary handlers of Problems. SNCC staff are responsible for opening Problems upon reporting or discovery and managing Problems throughout their lifecycle. Problems created by the Help Desk are escalated to SNCC staff for Problem management.

SNCC Staff Duties

·         Initially respond to and record service and server issues reported via monitoring

·         Initially respond to service and server issues reported by the Help Desk

·         Resolve service support issues quickly

·         Troubleshoot to determine cause

·         Escalate issues appropriately using (see section 7.3)

·         Engage Situation Manager if network component is not involved

7.4.2 Situation Manager

Situation Manager is a management role invoked in longer or more severe Outages. This role ensures management involvement in communications, coordination, and business decisions encountered during issue resolution and post-incident review.

 

The tasks associated with the Situation Manager role are routinely performed by an appropriate lead technologist and require management attention only when the incident warrants. When possible, the lead technologist should start communications and Problem resolution before a Situation Manager becomes involved with a significant issue.

Situation Managers duties:

·         Lead communications during a Problem with an Outage (See Section 7.6 for Problem Communication Guidelines)

·         Lead Situation Mobilization

o   Identify the issue response team

o   Make sure team members understand the issue

o   Define Communication Plan

·         Lead Issue Resolution Efforts

o   Meet with technologists to analyze options, business risks, and technical risks.

o   Determine how to proceed, weighing all concerns.

o   Make sure multiple technologists are not working on conflicting paths.

o   If necessary, stop troubleshooting to ensure that team members fully understand the issue and that all efforts are coordinated.

Study the business issues related to the issue to make resolution decisions. For example, a security- compromised system should be fully investigated before being put back on the network, but two failed core network devices should be put back into production as quickly as possible.

7.4.3 Duty Manager

The Duty Manager is an on-call position filled on a rotating schedule by SEO managers and the SEO Director. Duty Managers share much of the same responsibilities as a Situation Manager, although in most situations these responsibilities will be shared by the Duty Manager and the active Situation Manager.

 

Duty Managers serve as the after hours Situation Manager for the many DoIT services that do not regularly staff on-call positions as well as when the identified Situation Manager is not available.

 

Furthermore, since Duty Managers hold management positions within DoIT, they are capable of responding to situations in ways that Situation Managers can not, such as using discretionary budgets for additional resources and activating staff in off-hours for situations that warrant it.

7.4.4 Authority in Charge

The Authority in Charge is a generic term to refer to a specific role that is responsible for coordinating and leading DoIT’s response to a specific Problem.

 

The Authority in Charge may refer to a Situation Manager, Duty Manager, Tiger Team Manager, Director, Executive Management, or any other position that may lead a situation response.

 

For any given situation there is only one Authority in Charge. All decisions and communications are to come from the Authority in Charge, either directly or indirectly.

7.4.4 Tiger Team Manager

See Section 7.7.3 for information regarding Tiger Team Manager Roles and Responsibilities.

 

7.5 Outage Management

Any event that includes a Service Outage is automatically defined as a Problem and is to be managed in accordance with the Problem Management processes as described in the Operational Framework.

7.5.0 Outage Definition

Any event in which any of the following criteria are or will be met will constitute an IT Service Outage for a DoIT managed/monitored IT Service:

·         The IT Service is unavailable to its customers or a subset of its customers.

·         The IT Service is unable to function as designed and installed.

·         IT Service performance has degraded to a degree that is perceivable to End Users.

·         The ability of University stakeholders to do their business is impaired.

7.5.1 Outage Discovery and Reporting

Planned Outages

Planned Outages are documented and associated with Change Requests. Prior to Change Request approval, impact analysis is performed by Change Management and any associated Outages are created and posted to the Outages Page at that time.

 

Unplanned Outages

Unplanned Outages are identified in the Problem Logging stage of the Problem Lifecyle. See Section 7.2.2 for more information. Problems with Open Outages are managed by the SNCC until the Outage is closed and/or the Problem is resolved.

 

7.6 Problem Communication Guidelines

7.6.0 Communication Scope

The level of communication necessary for any given Problem is based on the impact and severity of the Problem. In order to gauge Problem impact and severity, the following factors should be considered:

·         Number of users affected

·         Outage Duration (expected or actual)

·         Importance of affected service

·         External events that may be occurring at the time of the Outage

 

For Problems with large impact and/or severity, multiple channels of communication need to be opened and maintained. At a minimum, communication lines must exist within the responding group and there must be appropriate external communication with customers, stakeholders, etc.

 

7.6.1 Communications Roles and Responsibilities

Situation Manager Communications

The process of communicating to stakeholders varies based on the scope and audience of the service, as determined by the Situation Manger.

 

Situation Managers are responsible for the following communications as necessary:

·         Write and/or edit information distributed to affected users.

·         Maintain direct communication to internal DoIT staff, via the DoIT Operations Microsoft Team.

·         If necessary, Situation Managers will write a standard statement for the Help Desk to communicate to campus users if many calls are being received. This ensures a standard message to all affected users.

·         Communicate with DoIT management at your discretion (at minimum per vertical escalation guidelines)

·         Communicate with DoIT customer technology managers, DoIT staff, and vendors as appropriate.

DoIT Communications

DoIT Communications will lead communication efforts for outages that are campus-wide in scope or at the discretion of the Authority in Charge.

SNCC Operators Communications

SNCC Operators are often at the center of the communication process, though their role is primarily in coordinating service restoration efforts. The DoIT Operations Microsoft Team is the preferred method of communication with the SNCC, although email and phone are also utilized.

7.6.2 Communication Standards

In general, communications regarding Problems should include the following content. Content should be concise, confident, and solution-oriented and inform on:

·         What is happening and/or going to happen

·         Timeline for the fix

·         Why it is relevant to the user

·         The general groups of users that may be most affected

·         What DoIT is doing to fix the issue

·         The risks—and what DoIT will do to mitigate them

 

7.7 Tiger Team

In the event that the Situation Manager is unable to lead issue resolution efforts so as to identify the root cause of the Problem and/or resolve the Problem, the Situation Manager may ask a DoIT Director or CIO/Deputy CIO to request a Tiger Team.

 

DoIT executive team members may request a Tiger Team at their discretion, at any time.

7.7.0 Goal

The purpose of the Tiger Team is to ensure that all necessary resources (human and otherwise) are available to quickly and efficiently determine the root cause for a Root Cause Undetermined Problem.

7.7.1 Scope

In order to ensure that root causes are identified as quickly as possible, the Tiger Team will have members from any and all units that could potentially play a role in diagnosing, investigating, and communicating the Problem to stakeholders.

 

Managers from represented units will be responsible for ensuring that Tiger Team activities are given sufficient priority so as to not conflict with existing planned or unplanned work in those units.

7.7.2 Tiger Team Manager Responsibilities

It is the responsibility of the Tiger Team Manager to lead the following activities in the event of a Problem with unknown root cause:

·         Convene the Tiger Team.

·         Assess the existing information documented in the Problem Ticket.

·         Draft plan to further investigate root cause(s) of Problem with an Outage.

·         Aggregate outage diagnostics.

·         Communicate additional information with stakeholders (see Section 7.7.5)

·         Repeat steps 2-6 until Root Cause can be determined.

·         Determine root cause(s) for Problem with an Outage.

·         Document relevant information in the Problem Ticket.

·         If Problem is not resolved already, transition Situation Manager to appropriate Service Team.

7.7.3 Membership

See Appendix C for more information about Tiger Team Membership.

7.7.4 Tiger Team Communication Plan

The Tiger Team Manager is responsible for carrying out the Communication Plan for the Tiger Team instance.

 

The Communication Plan consists of:

1.      Suggested weekly progress updates to the Tiger Team Requestor. These updates can be more frequent or less frequent, at the discretion of the TT Manager.

2.      Communication between Tiger Team and Stakeholders as needed.

3.      Collaboration with DoIT Communications for Problems with Outages that have a scope significantly large enough to warrant it.

4.      Final Report- A report to be sent to the Tiger Team Requestor, to include:

·         Problem Overview- This is a concise overview of the work completed by the Tiger Team, leading up to the problem resolution and/or workaround.

·         Problem Root Cause- A detailed description of what the actual root cause of the Problem was.

·         Problem Resolution- A detailed description of the resolution and/or any workarounds put in place to mitigate the Problem.

·         Gap Analysis and Recommendations- An analysis of any gaps, failures in process or procedure, or other findings that may help to facilitate or prevent a similar issue with identifying root cause or resolution of an Incident/Problem in the future.

 


  

7.8 Appendices

Appendix A- Situation Response- All Roles

This document is intended to be an exhaustive list of potentially necessary roles used in responding to a situation that has or potentially could impact one or more DoIT services.

Given the scalable nature of DoIT’s Situation Response, the type and severity of any given situation may call for some or all of the following roles.

 

It is at the discretion of the Authority in Charge during an active situation to determine which of these roles are needed and to assign them to individuals.

Authority In Charge

The Authority in Charge is a generic term to refer to a specific role that is responsible for coordinating and leading DoIT’s response to a specific situation.

 

The Authority in Charge may refer to a Situation Manager, Duty Manager, Tiger Team Manager, Director, Executive Management, or any other position that may lead a situation response.

 

For any given situation there is only one Authority in Charge. All decisions and communications are to come from the Authority in Charge, either directly or indirectly.

Situation Manager

Situation Manager is a management role invoked in longer or more severe outages. This role ensures management involvement in communications, coordination, and business decisions encountered during issue resolution and post-incident review.

 

The tasks associated with the Situation Manager role are routinely performed by an appropriate lead technologist and require management attention only when the incident warrants. The lead technologist should start communications and issue resolution before a Situation Manager becomes involved with a significant issue.

Situation Technologist

The situation technologist is the responding technologist with the most direct knowledge of the problematic systems, equipment, infrastructure, etc.

 

The situation technologist also provides the Situation Manager with wide-ranging technical information and advice.

Environmental Specialist

For any incident where the environmental infrastructure is involved, this role should be assigned. It will generally fall to a member of the Data Center team.

Communicator

Multiple channels of communication need to be opened and maintained, depending on need and appropriateness. At a minimum, communication lines must exist within the responding group, and there must be appropriate external communication with customers, top management, etc. SNCC operators are often at the center of the communication process, though their active involvement in service restoration may inhibit effectiveness.

DoIT Communications will lead communications for situations that are campus-wide in scope.


The DoIT Outage page is one primary channel for communicating with campus users.

Scribe

Scribes are responsible for tracking and documenting all situation response activities. These are used as both a resource for the Authority in Charge during an active situation as well as for review purposes following the resolution of the situation.


The WiscIT Problem record is the official and primary record of the handling of the situation. The Problem record is internal to DoIT and so can, and should, contain technical details.

Runner

People to actually move throughout the data center or other physical locations as needed - for example, double checking actual power off or powering machines back up or delivering physical resources to responding groups.

External liaisons

These could be wide-ranging. They might include situation managers for affected services as established by the Operational Framework, Service Team leaders, application administrators, or individuals with particular business relationships that could prove valuable.

Researcher

Someone to look things up and dig out information, in resources ranging from CMDB to Google to the phone book.

Door Guard

A person to control access to the Data Center or other physical resources. The Door Guard ensures that the area remains secure and that overcrowding is avoided.

Spotter/QA officer

It can be very valuable to have someone who just monitors progress and provides an additional pair of eyes to double-check for missteps. This person may be present physically and/or monitoring the situation virtually, via MS Teams or Out of Band Chat.

Human Resource Manager

It's this person's role to keep an eye on people issues—stress, burnout, hunger, thirst, exhaustion, etc., and advise the Situation Manager as needed.

 

The Human Resources Manager will also advise on situations that require non-standard staffing.

 

Appendix B- Vertical Escalation Guidelines

The table below lists the guidelines and contacts used by SNCC to provide Vertical Escalation/notification of a Problem with an open Outage.

 

The elapsed time listed for escalations to Situation Managers, Duty Managers, and COO/Core Directors are maximums. If the situation dictates, escalation can and should be made before the listed times.

Elapsed Time

Escalation Path

10 minutes

SNCC Level 1 escalates to appropriate support technologist identified for the service or component. If there is no after hours support defined or there are no after hours technologist available, SNCC escalates directly to the Situation Manager for the service affected.

2 hours

SNCC staff ensures that the Situation Manager is contacted not only for the primary service experiencing the Outage, but Situation Managers for all dependent services are contacted as well. For example, an LDAP authentication outage will cause an Office 365 Outage. The Office 365 Situation Manager should be contacted along with the LDAP Situation Manager.

8 hours

SEO Duty Manager

12 hours

COO and Core Directors (Note: SEO Duty Manager will usually be responsible for escalation to the COO and Core Directors (although a reminder to the Duty Manager by SNCC would likely be appreciated).

SEO Duty Manager may request that SNCC operator triggers Vertical Escalation on their behalf. See KB #3632 for information on how to trigger Vertical Escalation.

"It is the responsibility of the Chief Operating Officer to escalate outage notifications to campus and UW System leadership as necessary and appropriate. The severity of the outage and the breadth of impacted service(s) will determine who should be notified and how often updates should occur."

At Discretion of SNCC Operator, Situation Manager, or Duty Manager

Any situation may be escalated prior to the times listed here if the situation warrants it. The times listed are maximums for when the situation must be escalated.


 

Appendix C- Tiger Team Membership

The Tiger Team is comprised of representatives (managers or their designees) from the following areas:

 

Member

Tiger Team Role

ITSM- Incident Manager/ITSM OnCall

Tiger Team Manager; Service Management Liaison

ITSM- Event Management

Event Management Liaison

Campus LAN/WAN

Campus LAN/WAN and Firewall Support Liaison

Network- Operation Engineering

NS-OpEng Support Liaison

Solutions Engineering- Linux/AIX

Linux/AIX Server Support Liaison

Solutions Engineering- Windows

Windows Server Support Liaison

Applications & Systems Engineering

VM Support Liaison

EBS/ERP Admin

Liaison for SIS, HRS, Peoplesoft, etc.

Communications

Coordinate broader communications to stakeholders as needed

Identity & Access Management

Identity and Authentication Support Liaison

Cybersecurity

Coordinate security response and information sharing

Helpdesk

Helpdesk/User Services Liaison

 

Additional Tiger Team members may be added at the discretion of the Tiger Team Manager.



Keywords:
Problem Management Operational Framework Opfram SNCC Duty Manager Outage Response Review Outage Escalation Communication 
Doc ID:
45471
Owned by:
Ramsay B. in ITSM
Created:
2014-12-05
Updated:
2022-08-23
Sites:
DoIT IT Service Management, DoIT Staff, Systems & Network Control Center