DoIT Operational Framework - Section 4.0 - Incident Management

Section 4.0 of the Operational Framework for viewing and printing. (Note: This section of the Operational Framework is currently under review by the DoIT IT Service Management Team)

In the Operational Framework, UW-Madison DoIT aims to improve the operational processes used to support DoIT provided services. This portion of the framework addresses Incident Management. The overriding goal of these processes and related standards is to ensure best-possible operational efficiency, service, and uptime.

Incident Management involves restoring normal service operation as quickly as possible following service-affecting incidents, minimizing adverse impact on business operations, and maintaining the best possible levels of service quality and availability.

DoIT uses an Incident Management process led by the DoIT Systems and Network Control Center (SNCC) and DoIT Help Desk to identify, investigate, resolve, and prevent the recurrence of events and conditions that negatively impact customers and users.

4.1 Incident Management Standards
4.2 Incident Response Guidelines
4.3 Incident Response Documentation
4.4 Guidelines for Escalating Incidents to Management
4.5 Incident Transfer Escalation Guidelines
4.6 Roles and Responsibilities
4.7 Situation Manager Standards
4.8 Post-Incident Review
4.9 Handling an Emergency or Declaring a Disaster
4.10 Handling Security Incidents
4.11 Incident Management Workflow

Section 4.1 - Incident Management Standards

The Incident Management process described in this section addresses: 1) incidents that DoIT has a service obligation to fix; 2) incidents that end-users may call the Help Desk about; or 3) incidents that may indicate larger issues affecting multiple customers.

Once a user, DoIT technologist, customer administrator/engineer, or monitoring sensor reports an incident, SNCC and/or Help Desk staff members work quickly to resolve it by:

  • Troubleshooting using online tools and resources and following documented procedures to restore service.
  • Assigning a priority level and following the guidelines for incident management
  • Engaging appropriate technical staff for the issue, if necessary
  • Communicating via the Service Support Conference Room with key staff from the Help Desk, SNCC, and (already engaged) DoIT technologists. (See 4.1.1)
  • Engaging a Situation Manager for the issue if the priority warrants doing so or if guidelines dictate
  • Creating a record in DoIT's Incident Management System and documenting all relevant information about the incident including diagnosis, chat logs, and solution notes, or forwarding the record to technical staff, as guidelines dictate

4.1.1 Service Support Conference Room

DoIT technologists that discover an incident, or are actively responding to an incident, are encouraged to engage with key SNCC and Help Desk personnel via the Document 13356 is unavailable at this time.;(internal DoIT document). This provides an efficient method of updating the organization on the status of an incident, and enlisting the aid of incident response personnel in updating the record of the incident within DoIT's official Incident Management System; as well as contacting other personnel that may be required during the incident handling.

 Section 4.2 - Incident Response Guidelines

Work days are Monday through Friday, business hours are 7:45 a.m. to 4:30 p.m. CT, excluding campus holidays. Incident acknowledgment refers to the act of an appropriate DoIT staff member (Level 1 support, Level 2 support, Level 3 engineering, Management, etc) taking ownership of an incident record within DoIT's incident management system. Response refers to the beginning of resolution actions appropriate to that record and its priority. Listed issue resolution times are goals and are not guaranteed.

Determining Incident Priority

Incident priority is determined by considering both impact and urgency.

Impact is a measure of the effect of an incident on business processes. It is partly determined by the number of users affected, but the criticality of the service to the business of the university is also a factor. Incidents that may compromise health and safety have an elevated impact.

Urgency is a measure of how long it will be until an incident has a significant impact. For example, the failure of part of the enrollment system may not be as urgent if the next enrollment period is several weeks away. The availability of a workaround, alternative, or circumvention mitigates urgency.

Priority may be raised if a customer has a contract with a DoIT service group for accelerated response on a component, system, or service. This can be accomplished in the incident management system by setting the impact and/or urgency to a higher level than the situation warrants.

A matrix of impact and urgency determines priority:

Incident Priority
Urgency
Resolution goal:
Defer/hold for monitoring/tracking

Resolution goal:
5 business days or less

Resolution goal:
2 business days or less
Resolution goal:
4 hours or less
Impact
Multiple Units affected / Critical business impact / Safety
Priority 4
Priority 2
Priority 1
Priority 1
Single unit affected / Important business impact
Priority 4
Priority 3
Priority 2
Priority 1
Few individuals affected / Minimal business impact
Priority 4
Priority 3
Priority 3
Priority 2

Priority 1 (Severe)
Guidelines: Acknowledgment of a Priority 1 incident occurs within 10 minutes. Priority 1 incident handling applicable hours are 24 x 7 x 365. Initial response from the assignee to the customer typically is via a phone call to gather more information and to formulate an agreed-upon strategy for a resolution. The issue resolution goal is four elapsed hours (work will continue after 4:30 or on weekends) or as specified in the customer’s Platform Hosting Agreement contract covering the requested service. DoIT will escalate the issue to a situation manager if unresolved within two hours of the diagnosis start time.

Priority 2 (Standard)
Guidelines: Acknowledgment of a Priority 2 will occur within one business day. The issue resolution goal is two business days. Applicable hours of Priority 2 handling are normal business hours. DoIT will escalate the incident to the group manager or situation manager if the targets for response and resolution are not met.

Priority 3 (Routine)
Guidelines: Acknowledgment of a Priority 3 should occur within 2 business days. Applicable hours for work on a Priority 3 are normal business hours. While the issue resolution goal is 5 business days, if follow-up or long-term work on an underlying issue is required the incident should be re-classified as a Priority 4.

Priority 4 (Hold)
Guidelines: Acknowledgment of a Priority 4 should occur within 2 business days. Applicable hours for work on a Priority 4 are normal business hours. Because this Priority is used for Tracking long-term work on an incident, to indicate the work for the incident is on hold, or to keep an incident open for monitoring, the issue resolution goal may vary greatly depending on specific details about the incident. Re-classification of an incident from a Priority 4 back to a higher Priority should frequently occur, if the initial reason for Priority 4 classification allows. One example: if an incident requires an entity outside of DoIT to complete a lengthy task before DoIT can take actions to fully resolve the incident, the incident should be marked Priority 4 until the outside entity is able to complete the work, and then an appropriate higher Priority as soon as DoIT is free to work on incident resolution.

 4.2.1 Document 6043 is unavailable at this time.



 Section 4.3 - Incident Response Documentation

Documentation Repositories

The primary location of all SNCC response documentation is the System Network Control Center Support Knowledgebase, a site within the DoIT Support Knowledgebase. This multi-group knowledgebase provides procedural and technical documentation for SNCC, Help Desk, and other groups. The URL for the SNCC Knowledgebase is:

Other documentation sources are:

Documentation of Incidents

The official record of an incident and steps taken to resolve it are maintained in DoITs Incident Management System.


 Section 4.4 - Guidelines for Escalating Incidents to Management

The table below lists the guidelines and contacts used by SNCC to escalate incidents and notify management of an unresolved Priority 1 incident.

Table 4-2 Guidelines for Escalating to Management

Elapsed Time Priority One Incident
10 Minutes Systems & Network Control Center Operators handle the technical escalation appropriately: to Network Operational Engineering (OpEng) if a network component is involved or to the relevant support technologist. If the service is not supported after hours or there are no after-hours technologist's available, the Operators escalate the incident directly to the Situation Manager.
2 hours SNCC staff ensures that the Situation Manager for the primary service experiencing the outage, and situation managers for all dependent services are contacted. For example, an LDAP authentication outage will cause a WiscMail Impact one outage. The WiscMail Situation Manager should be contacted along with the LDAP Situation Manager. Refer to Identify the Situation Manager in the Situation Manager Standards (Section 4.7) portion of the Operational Framework for more information about contacting the appropriate Situation Manager.
8 hours SNCC ensures SEO Duty Manager is involved. See DoIT Knowledge base article Document 2816 is unavailable at this time. for contact information.
12 hours SNCC handling operator consults with Duty Manager regarding engaging the CIO office.

 

Section 4.5 - Incident Transfer Escalation Guidelines

SNCC and Help Desk staff should follow these guidelines for incident transfer escalations. Situation Managers may need to make occasional exceptions.

4.5.1 Priority 1 Case Escalation

  • If a network component is not involved in the service outage, SNCC staff should contact the appropriate service support technologist by using contact information defined in the Configuration Management Database (CMDB)
  • If a network component is involved in the service outage and SNCC staff cannot contact Op Engineering, contact appropriate network engineer and the Op Engineering manager
  • Priority 1 case escalation always requires direct interactive contact (voice, chat, email with reply) with the area to which the case is being escalated. Do not assume that others can work on a case until you confirm it interactively
  • Help Desk always escalates to SNCC Level 1 unless response documentation specifies otherwise

4.5.2 Priority 2, 3 and 4 Case Escalation

  • SNCC and Help Desk staff can escalate problems directly to the Level 3 Support Technologist if that escalation path is outlined in the response documentation
  • SNCC and Help Desk staff should monitor cases dispatched to Level 3 Support Technologists to ensure that they have been accepted. SNCC and Help Desk staff should contact queue managers if cases are not being accepted

4.5.3 Guidelines for Level 2 and Level 3 Staff Members Who Have Accepted or Been Assigned an Incident

As a Staff Member who has agreed to work on, or been assigned to an incident, you are responsible for handling the incident appropriately. For Priority 1 cases, SNCC staff will retain ownership of the incident record. For Priority 2, 3 and 4 cases, accept the case acknowledging that you are aware of the case and will work on case resolution in accordance with the priority definitions. For Priority 1 cases, SNCC staff will continue to check up and provide input, but it is your responsibility to solve the problem and ensure SNCC staff can keep the incident record up-to-date. This is essential to ensure that appropriate customer updates can be provided. If you are overwhelmed during a Priority 1 outage and need assistance, ask SNCC staff to call a Situation Manager or ask for more help. If you are no longer able to work on an incident notify the Situation Manager. If additional people do become involved in solving an incident, notify SNCC.


 Section 4.6 - Roles and Responsibilities

The DoIT Incident Management Process involves the following roles and responsibilities.

Table 4-3 Incident Management Roles and Responsibilities

Role

Responsibilities

Examples

Help Desk

  • Initially respond to and record end-user issues

  • Resolve issues quickly

  • Fulfill end-user requirements for resolving issues of different types and severities

  • Escalate issues appropriately using hierarchical and functional escalation guidelines

  • Troubleshoot to determine cause

  • Communicate with end-users regarding incident status

  • Provide review to ensure that all incident tickets, regardless of assignment, are closed

  • Respond to incoming client support phone calls and email

  • Follow documented support procedures

  • Develop quick resolution processes (with help from support levels 2/3) for common issues

  • Dispatch to support and engineering as needed and ensure appropriate response to the ticket’s impact level

  • Post online and telephone announcements


SNCC Level 1

  • Initially respond to and record service and server issues reported via monitoring

  • Initially respond to service and server issues reported by the Help Desk

  • Resolve service support issues quickly

  • Troubleshoot to determine cause

  • Escalate issues appropriately using functional and hierarchical escalation guidelines

  • Engage Situation Manager if network component is not involved

  • Follow documented support procedures

  • Respond to monitoring alerts

  • Manage vendor issues, such as hardware service calls

Operational Engineering Level 2 and Field Services


  • Resolve standard technical issues escalated from SNCC Level 1

  • Aid Help Desk and SNCC Level 1 staff in analyzing impact

  • Seek and identify anomalies

  • Follow functional and hierarchical escalation guidelines

  • Engage Situation Manager as needed

  • Perform system and network troubleshooting not requiring Network Services Level 3

  • Troubleshoot connectivity issues

Level 3 Support Technologist

  • Respond to issues when high level of technical knowledge is required or ASAP resolution to serious issue is needed

  • Provide technical leadership

  • Perform technical quality assurance

  • When involved with issue resolution, help SNCC staff analyze severity and determine whether a Situation Manager is warranted

  • Improve proactive operational performance by identifying and implementing tools

  • Seek anomalies and refer to SNCC staff

  • Design and configure major changes and implementations

  • Perform Quality Assurance for device configuration

  • Participate in capacity planning; perform management processes

  • Participate in system and implementation design and change planning

  • Respond to major issues

  • Provide support if expertise not available through SNCC staff

  • Ensure that new implementations address ongoing operations issues such as monitoring, technical support process development, etc

System Network Control Center – Systems Management

  • Maintain SNCC support response documentation

  • Ensure Help Desk documentation for user support is consistent with SNCC customer agreements

  • Coordinate post incident review (PIR)

  • Coordinate follow up on PIR action items

  • Update per-customer support response information

  • Schedule and lead post-incident review

Situation Manager

(see section 4.7 for more details)

  • Manage serious issues

  • Implement communication plan for serious issues

  • Ensure appropriate resources are available for issue resolution

  • Assess resolution/strategy

  • Provide management leadership for specific issues; i.e., whether to bring up a down system ASAP or perform outage-lengthening forensic analysis of the issue

Process Owners (Help Desk Manager, SNCC Manager)

  • Improve incident management process; i.e., reporting from different views, use of data, etc.

  • Analyze issue/cause trends

  • Identify and lead implementation of necessary improvements

  • Ensure that the Help Desk and SNCC maintain updated troubleshooting procedures, tools, and Knowledgebase documents

  • Track issues delegated to others

  • Update DoIT management on the state of incident management

  • Provide management leadership for general processes

 4.6.1 Document 11762 is unavailable at this time.



 Section 4.7 - Situation Manager Standards

Situation Manager is a management role invoked in longer or more severe outages. This role ensures management involvement in communications, coordination, and business decisions encountered during issue resolution and post-incident review.

The tasks associated with the Situation Manager role are routinely performed by an appropriate lead technologist and require management attention only when the incident warrants. The lead technologist should start communications and issue resolution before a Situation Manager becomes involved with a significant issue.

4.7.1 Situation Manager Involvement

SNCC should contact a Situation Manager as early as practical about any Priority 1 situation that has the potential to last more than two hours or in any other situation that SNCC or a lead technologist considers severe enough to warrant immediate management attention.

4.7.2 Identify the Situation Manager

Each service will designate a Situation Manager call list that will include at least one primary and one backup Situation Manager. The list is maintained in the SNCC knowledgebase.

The SNCC staff member who has accepted the escalation of the issue should make this call, which should FOLLOW technical/horizontal escalation (i.e., get a technologist working on the issue first).

4.7.3 Situation Manager Responsibilities

The Situation Manager protects and balances the interests of DoIT and customers, and works to achieve at least an interim solution to issues as quickly as possible. If you are named the Situation Manager for a significant issue, the following is expected of you:

Lead communication

If necessary, write a standard statement for the Help Desk to communicate to campus users if many calls are being received. This ensures a standard message to all affected users. Maintain direct communication to internal DoIT staff. Include concise, confident, and solution-oriented information on:
  • What is happening and/or going to happen
  • Timeline for the fix
  • Why it is relevant to the user
  • The general groups of users that may be most affected
  • What DoIT is doing to fix the issue
  • The risks—and what DoIT will do to mitigate them
Verify that interim and post-issue communication to all affected users is completed.
  • Write and/or edit information distributed to affected users. (DoIT Communications should be involved in notification regarding campus-wide issues, per existing processes)
  • Communicate with DoIT management at your discretion (at minimum per vertical escalation guidelines)
  • Communicate with DoIT customer technology managers, DoIT staff, and vendors as appropriate

Lead situation mobilization

  • Identify the issue response team
  • Make sure team members understand the issue
  • Secure a war room or other facilities if necessary

Lead issue resolution decisions

  • Meet with technologists to analyze options, business risks, and technical risks. Determine how to proceed, weighing all concerns
  • Make sure multiple technologists are not working on conflicting paths.
  • If necessary, stop troubleshooting to ensure that team members fully understand the issue and that all efforts are coordinated
  • Study the business issues related to the issue to make resolution decisions. For example, a security-compromised system should be fully investigated before being put back on the network, but two failed core network devices should be put back into production as quickly as possible

 

Section 4.8 - Post Incident Review

SNCC-SM will coordinate Post-Incident Review (PIR) as determined necessary by DoIT Management. (Facilitators from other DoIT groups may be called upon to lead this effort.) An assigned staff member from SNCC-SM will perform the following steps:

  • Determine if PIR is required
  • Schedule the review. Involve an outside DoIT facilitator as appropriate
  • Include the following topics in the discussion:
        -What happened
        -Issues involved
        -Resulting action items
  • Document the proceedings of the review and produce two versions of a PIR document: a high-level review for departments and an in-depth version to help drive improvement within DoIT. Assign and follow up on the action items, enlisting the help of other managers as needed

 

Section 4.9 - Handling an Emergency or Declaring a Disaster

The DoIT CIO's office has the authority to declare disasters or emergencies. Refer to the University Response Plan or the DoIT Disaster Recovery Plan.

Disaster Recovery
University Response Plan - The University Response Plan contains sensitive information making it unavailable for public distribution.

 

Section 4.10 - Handling Security Incidents

Refer to Urgent Response documentation provided by the Security Team

 

Section 4.11 - Incident Management Workflow

The following chart outlines the standard incident management workflow.

Incident Mangement Workflow

See Also:




Keywords:operational framework incident management print version entire section outage   Doc ID:11040
Owner:Andy B.Group:DoIT IT Service Management
Created:2009-06-15 19:00 CDTUpdated:2015-07-08 11:24 CDT
Sites:BOREAS, DoIT Help Desk, DoIT IT Service Management, DoIT Staff, Learn@UW Utility, Middleware, Network Services, Systems & Network Control Center, Systems Engineering, Systems Engineering and Operations, WiscMail, WiscNet
Feedback:  9   4