DoIT Operational Framework - Section 4.0 - Incident Management
In the Operational Framework, UW-Madison DoIT aims to improve the operational processes used to support DoIT provided services. This portion of the framework addresses Incident Management. The overriding goal of these processes and related standards is to ensure best-possible operational efficiency, service, and uptime.
Incident Management involves restoring normal service operation as quickly as possible following service-affecting incidents, minimizing adverse impact on business operations, and maintaining the best possible levels of service quality and availability.
DoIT uses an Incident Management process led by the DoIT Systems and Network Control Center (SNCC) and DoIT Help Desk to identify, investigate, resolve, and prevent the recurrence of events and conditions that negatively impact customers and users.
Section 4.1 - Incident Management Standards
The Incident Management process described in this section addresses: 1) incidents that DoIT has a service obligation to fix; 2) incidents that end-users may call the Help Desk about; or 3) incidents that may indicate larger issues affecting multiple customers.
Once a user, DoIT technologist, customer administrator/engineer, or monitoring sensor reports an incident, SNCC and/or Help Desk staff members work quickly to resolve it by:
- Troubleshooting using online tools and resources and following documented procedures to restore service.
- Assigning a priority level and following the guidelines for incident management
- Engaging appropriate technical staff for the issue, if necessary
- Communicating via the Service Support Conference Room with key staff from the Help Desk, SNCC, and (already engaged) DoIT technologists. (See 4.1.1)
- Engaging a Situation Manager for the issue if the priority warrants doing so or if guidelines dictate
- Creating a record in DoIT's Incident Management System and documenting all relevant information about the incident including diagnosis, chat logs, and solution notes, or forwarding the record to technical staff, as guidelines dictate
4.1.1 DoIT Operations Support Channel
DoIT technologists that discover an incident, or are actively responding to an incident, are encouraged to engage with key SNCC and Help Desk personnel via the [Link for document 91687 is unavailable at this time];(internal DoIT document). This provides an efficient method of updating the organization on the status of an incident, and enlisting the aid of incident response personnel in updating the record of the incident within DoIT's official Incident Management System; as well as contacting other personnel that may be required during the incident handling.Section 4.2 - Incident Response Guidelines
Work days are Monday through Friday, business hours are 7:45 a.m. to 4:30 p.m. CT, excluding campus holidays. Incident acknowledgment refers to the act of an appropriate DoIT staff member (Level 1 support, Level 2 support, Level 3 engineering, Management, etc) taking ownership of an incident record within DoIT's incident management system. Response refers to the beginning of resolution actions appropriate to that record and its priority. Listed issue resolution times are goals and are not guaranteed.
Determining Incident Priority
Incident priority is determined by considering both impact and urgency.Impact is a measure of the effect of an incident on business processes. It is partly determined by the number of users affected, but the criticality of the service to the business of the university is also a factor. Incidents that may compromise health and safety have an elevated impact.
Urgency is a measure of how long it will be until an incident has a significant impact. For example, the failure of part of the enrollment system may not be as urgent if the next enrollment period is several weeks away. The availability of a workaround, alternative, or circumvention mitigates urgency.
Priority may be raised if a customer has a contract with a DoIT service group for accelerated response on a component, system, or service. This can be accomplished in the incident management system by setting the impact and/or urgency to a higher level than the situation warrants.
A matrix of impact and urgency determines priority:
Incident Priority |
Urgency |
||||
---|---|---|---|---|---|
Resolution goal: Defer/hold for monitoring/tracking |
Resolution goal: |
Resolution goal: 2 business days or less |
Resolution goal: 4 hours or less |
||
Impact |
Multiple Units affected / Critical business impact / Safety |
Priority 4 |
Priority 2 |
Priority 1 |
Priority 1 |
Single unit affected / Important business impact |
Priority 4 |
Priority 3 |
Priority 2 |
Priority 1 |
|
Few individuals affected / Minimal business impact |
Priority 4 |
Priority 3 |
Priority 3 |
Priority 2 |
- Priority 1 (Severe)
- Guidelines: Acknowledgment of a Priority 1 incident occurs within 10 minutes. Priority 1 incident handling applicable hours are 24 x 7 x 365. Initial response from the assignee to the customer typically is via a phone call to gather more information and to formulate an agreed-upon strategy for a resolution. The issue resolution goal is four elapsed hours (work will continue after 4:30 or on weekends) or as specified in the customers Platform Hosting Agreement contract covering the requested service. DoIT will escalate the issue to a situation manager if unresolved within two hours of the diagnosis start time.
- Priority 2 (Standard)
- Guidelines: Acknowledgment of a Priority 2 will occur within one business day. The issue resolution goal is two business days. Applicable hours of Priority 2 handling are normal business hours. DoIT will escalate the incident to the group manager or situation manager if the targets for response and resolution are not met.
- Priority 3 (Routine)
- Guidelines: Acknowledgment of a Priority 3 should occur within 2 business days. Applicable hours for work on a Priority 3 are normal business hours. While the issue resolution goal is 5 business days, if follow-up or long-term work on an underlying issue is required the incident should be re-classified as a Priority 4.
- Priority 4 (Hold)
- Guidelines: Acknowledgment of a Priority 4 should occur within 2 business days. Applicable hours for work on a Priority 4 are normal business hours. Because this Priority is used for Tracking long-term work on an incident, to indicate the work for the incident is on hold, or to keep an incident open for monitoring, the issue resolution goal may vary greatly depending on specific details about the incident. Re-classification of an incident from a Priority 4 back to a higher Priority should frequently occur, if the initial reason for Priority 4 classification allows. One example: if an incident requires an entity outside of DoIT to complete a lengthy task before DoIT can take actions to fully resolve the incident, the incident should be marked Priority 4 until the outside entity is able to complete the work, and then an appropriate higher Priority as soon as DoIT is free to work on incident resolution.
Section 4.3 - Incident Response Documentation
Documentation RepositoriesThe primary location of all SNCC response documentation is the System Network Control Center Support Knowledgebase, a site within the DoIT Support Knowledgebase. This multi-group knowledgebase provides procedural and technical documentation for SNCC, Help Desk, and other groups. The URL for the SNCC Knowledgebase is:
Other documentation sources are:
- DoIT's Configuration Management Database for technical contacts for servers and applications
- WiscNIC - https://aants.net.wisc.edu/cgi-bin/WiscnicSearch.cgi for customer contact information, IP-related network support information, etc.
- SNCC Internal KB https://kb.wisc.edu/sncc/internal/
- Help Desk Internal KB https://kb.wisc.edu/internal/
The official record of an incident and steps taken to resolve it are maintained in DoITs Incident Management System.
Section 4.4 - Guidelines for Escalating Incidents to Management
The table below lists the guidelines and contacts used by SNCC to escalate incidents and notify management of an unresolved Priority 1 incident.
Table 4-2 Guidelines for Escalating to Management
Elapsed Time | Priority One Incident |
---|---|
10 Minutes | Systems & Network Control Center Operators handle the technical escalation appropriately: to Network Operational Engineering (OpEng) if a network component is involved or to the relevant support technologist. If the service is not supported after hours or there are no after-hours technologist's available, the Operators escalate the incident directly to the Situation Manager. |
2 hours | SNCC staff ensures that the Situation Manager for the primary service experiencing the outage, and situation managers for all dependent services are contacted. For example, an LDAP authentication outage will cause a WiscMail Impact one outage. The WiscMail Situation Manager should be contacted along with the LDAP Situation Manager. Refer to Identify the Situation Manager in the Situation Manager Standards (Section 4.7) portion of the Operational Framework for more information about contacting the appropriate Situation Manager. |
8 hours | SNCC ensures SEO Duty Manager is involved. See DoIT Knowledge base article [Link for document 2816 is unavailable at this time] for contact information. |
12 hours | SNCC handling operator consults with Duty Manager regarding engaging the CIO office. |
Section 4.5 - Incident Transfer Escalation Guidelines
SNCC and Help Desk staff should follow these guidelines for incident transfer escalations. Situation Managers may need to make occasional exceptions.
4.5.1 Priority 1 Case Escalation
- If a network component is not involved in the service outage, SNCC staff should contact the appropriate service support technologist by using contact information defined in the Configuration Management Database (CMDB)
- If a network component is involved in the service outage and SNCC staff cannot contact Op Engineering, contact appropriate network engineer and the Op Engineering manager
- Priority 1 case escalation always requires direct interactive contact (voice, chat, email with reply) with the area to which the case is being escalated. Do not assume that others can work on a case until you confirm it interactively
- Help Desk always escalates to SNCC Level 1 unless response documentation specifies therwise
4.5.2 Priority 2, 3 and 4 Case Escalation
- SNCC and Help Desk staff can escalate problems directly to the Level 3 Support Technologist if that escalation path is outlined in the response documentation
- SNCC and Help Desk staff should monitor cases dispatched to Level 3 Support Technologists to ensure that they have been accepted. SNCC and Help Desk staff should contact queue managers if cases are not being accepted
4.5.3 Guidelines for Level 2 and Level 3 Staff Members Who Have Accepted or Been Assigned an Incident
As a Staff Member who has agreed to work on, or been assigned to an incident, you are responsible for handling the incident appropriately. For Priority 1 cases, SNCC staff will retain ownership of the incident record. For Priority 2, 3 and 4 cases, accept the case acknowledging that you are aware of the case and will work on case resolution in accordance with the priority definitions. For Priority 1 cases, SNCC staff will continue to check up and provide input, but it is your responsibility to solve the problem and ensure SNCC staff can keep the incident record up-to-date. This is essential to ensure that appropriate customer updates can be provided. If you are overwhelmed during a Priority 1 outage and need assistance, ask SNCC staff to call a Situation Manager or ask for more help. If you are no longer able to work on an incident notify the Situation Manager. If additional people do become involved in solving an incident, notify SNCC.
Section 4.6 - Roles and Responsibilities
The DoIT Incident Management Process involves the following roles and responsibilities.
Table 4-3 Incident Management Roles and Responsibilities
Role |
Responsibilities |
Examples |
---|---|---|
Help Desk |
|
|
SNCC Level 1 |
|
|
Operational Engineering Level 2 and Field Services |
|
|
Level 3 Support Technologist |
|
|
System Network Control Center Systems Management |
|
|
Situation Manager (see section 4.7 for more details) |
|
|
Process Owners (Help Desk Manager, SNCC Manager) |
|
|
Section 4.7 - Situation Manager Standards
Situation Manager is a management role invoked in longer or more severe outages. This role ensures management involvement in communications, coordination, and business decisions encountered during issue resolution and post-incident review.
The tasks associated with the Situation Manager role are routinely performed by an appropriate lead technologist and require management attention only when the incident warrants. The lead technologist should start communications and issue resolution before a Situation Manager becomes involved with a significant issue.
4.7.1 Situation Manager Involvement
SNCC should contact a Situation Manager as early as practical about any Priority 1 situation that has the potential to last more than two hours or in any other situation that SNCC or a lead technologist considers severe enough to warrant immediate management attention.
4.7.2 Identify the Situation Manager
Each service will designate a Situation Manager call list that will include at least one primary and one backup Situation Manager. The Primary and Secondary Situation Managers are listed in the CMDB, on the 'COOP Info' tab of the record for the service.
The SNCC staff member who has accepted the escalation of the issue should make this call, which should FOLLOW technical/horizontal escalation (i.e., get a technologist working on the issue first).
4.7.3 Situation Manager Responsibilities
The Situation Manager protects and balances the interests of DoIT and customers, and works to achieve at least an interim solution to issues as quickly as possible. If you are named the Situation Manager for a significant issue, the following is expected of you:
Lead communication
If necessary, write a standard statement for the Help Desk to communicate to campus users if many calls are being received. This ensures a standard message to all affected users. Maintain direct communication to internal DoIT staff. Include concise, confident, and solution-oriented information on:- What is happening and/or going to happen
- Timeline for the fix
- Why it is relevant to the user
- The general groups of users that may be most affected
- What DoIT is doing to fix the issue
- The risksand what DoIT will do to mitigate them
- Write and/or edit information distributed to affected users. (DoIT Communications should be involved in notification regarding campus-wide issues, per existing processes)
- Communicate with DoIT management at your discretion (at minimum per vertical escalation guidelines)
- Communicate with DoIT customer technology managers, DoIT staff, and vendors as appropriate
Lead situation mobilization
- Identify the issue response team
- Make sure team members understand the issue
- Secure a war room or other facilities if necessary
Lead issue resolution decisions
- Meet with technologists to analyze options, business risks, and technical risks. Determine how to proceed, weighing all concerns
- Make sure multiple technologists are not working on conflicting paths.
- If necessary, stop troubleshooting to ensure that team members fully understand the issue and that all efforts are coordinated
- Study the business issues related to the issue to make resolution decisions. For example, a security-compromised system should be fully investigated before being put back on the network, but two failed core network devices should be put back into production as quickly as possible
Section 4.8 - Post Incident Review
SNCC-SM will coordinate Post-Incident Review (PIR) as determined necessary by DoIT Management. (Facilitators from other DoIT groups may be called upon to lead this effort.) An assigned staff member from SNCC-SM will perform the following steps:
- Determine if PIR is required
- Schedule the review. Involve an outside DoIT facilitator as appropriate
- Include the following topics in the discussion:
-What happened
-Issues involved
-Resulting action items - Document the proceedings of the review and produce two versions of a PIR document: a high-level review for departments and an in-depth version to help drive improvement within DoIT. Assign and follow up on the action items, enlisting the help of other managers as needed
Section 4.9 - Handling an Emergency or Declaring a Disaster
The DoIT CIO's office has the authority to declare disasters or emergencies. Refer to the University Response Plan or the DoIT Disaster Recovery Plan.Disaster Recovery
University Response Plan - The University Response Plan contains sensitive information making it unavailable for public distribution.
Section 4.10 - Handling Security Incidents
Refer to Urgent Response documentation provided by the Security Team
Section 4.11 - Incident Management Workflow
The following chart outlines the standard incident management workflow.
See Also
- DoIT Operational Framework - Section 1.0 - Overview
- DoIT Operational Framework - Section 2.0 - Glossary of Significant Terms
- DoIT Operational Framework - Section 3.0 - Change Management
- DoIT Operational Framework - Section 5.0 - Configuration Management
- DoIT Operational Framework – Section 6.0 - Event Management
- Working with the Operational Framework (Policy)
- The DoIT Operational Framework, ITIL & Service Management Contacts at DoIT