For an overview of the Event Management and Monitoring capabilities DoIT provides on the servers for which it has operational responsibility see DoIT Monitoring Services (Overview)
- Please ensure that the request for Event Management ans Monitoring is consistent with any previous arrangements made with the Help Desk for the handling of the services the monitor corresponds to (e.g. If the help desk is supposed to alert Jane Doe of customer complaints about service X; Jane Doe should also be a contact for the monitor pertaining to service X).
- Requests for event monitors are made through WiscIT - DoIT's Change Management tool.
To start your request, log in to WiscIT, create a new Change Request, Category - Monitoring, SubCategory - Monitoring Request.
- Under Description, provide details for each of the following:
- Specific Application/process/URL/database/OS/Node/etc to be monitored (for web monitor, an IP, or URL - otherwise, describe the target of the monitor): This is the object you want the monitor to target. It could be any of the examples listed as well as a log file, system parameter, etc.
- Steps for the event monitor and conditions to check (look for a specific string of characters on a webpage, measure the time it takes to load a webpage, check the disk space free, ensure a process is running, etc): Here is where you outline what actions the event monitor must do to test, and what specifically the monitor is checking for. For example, you could ask for an event monitor to log in to a website and look for a specific string of characters. Or you could have a process on a DoIT server checked to ensure it's running, you could check how much memory is left on a device, etc.
- Thresholds, Failure, and/or Warning Conditions that require action(string of characters on a web page doesn't match the expected string, webpage taking >30 seconds to load, Disk free space has fallen below 30% but is above 5%, Disk free space is below 5%, process is not running, etc): List the specific values of text/value/time that you want the event monitor to trigger an alert for.
- Actions taken for each Threshold/Failure/Warning Condition (email warning of web page >30 seconds to load, email warning of Disk free space between 30% and 5%, incident created and escalated by the 24x7 staff if a process is not running, incident created and escalated by 24x7 staff if string of characters doesn't match on webpage): Here list which of your Thresholds/Failure Conditions you want notified by email only, or if you want them promoted to the 24x7 staff for incident creation (as well as an email notification).
- Frequency of event monitor (every 5 minutes 24x7, every half hour M-F 9-5, etc): How often you want the event monitor to check for the Thresholds/Failure Conditions. Some examples: every 5 minutes, 24x7; every half hour, 9 - 5 M-F; etc.
- Additional email address(es) to receive notification, for both email warnings and notification of incidents: Here you can have any combination individual addresses and group lists (jon.doe@wisc.edu, jane.doe@wisc.edu, ServiceSupport@wisc.edu).
- Person(s)' direct contact information in the event of incident (by default, 24x7 operators will refer to administrators listed in CMDB): By default, the 24x7 staff will use the service and monitor information available to them to contact the correct folks in the event of an incident. By default, they will look at the Service administrators. Beyond the Service Admins, who should be contacted if there is an incident created as a result of this particular event monitor, and how should the be contacted.
- Special SNCC 24x7 Staff instructions/directions/KB article reference for responding to this event monitor: Include relevant KB docs, any troubleshooting steps to take prior to contacting anyone outside of working hours, any actions that may be required prior to or after the incident for the particular event.
- Save your CR to submit your monitoring request. The Monitoring team will contact the requester if any further clarification is needed.
Example request
A common check to see if a webpage is loading properly. For this example we'll use the Service "University Financial System."
- After Change Request opened, and Category - Monitoring, SubCategory - Monitoring Request selected, CI set to: "University Financial System"
- Description:
- Specific Application/process/URL/database/OS/Node/etc to be monitored (for web monitor, an IP, or URL - otherwise, describe the target of the monitor): https://www.UFS.wisc.edu:6000
- Steps for the event monitor and conditions to check (look for a specific string of characters on a webpage, measure the time it takes to load a webpage, check the disk space free, ensure a process is running, etc):
Step 1: Navigate to https://www.UFS.wisc.edu:6000
Step 2: Log in as user: MonitorTest (password will be sent via phone or chat when requested from the Monitoring team.)
Step 3: Look for text "Welcome MonitorTest" after login.
- Thresholds, Failure, and/or Warning Conditions that require action (string of characters on a web page doesn't match the expected string, webpage taking >30 seconds to load, Disk free space has fallen below 30% but is above 5%, Disk free space is below 5%, process is not running, etc): (warn)Page load time is >20 seconds, (failure)Page load time is >45 seconds, (failure)text does not match.
- Actions taken for each Threshold/Failure/Warning Condition (email warning of web page >30 seconds to load, email warning of Disk free space between 30% and 5%, incident created and escalated by the 24x7 staff if a process is not running, incident created and escalated by 24x7 staff if string of characters doesn't match on webpage): Email warning if Page load time is >20 seconds but has not exceeded 45 seconds, alert the 24x7 staff and create incident if >45 seconds or if text does not match on two consecutive checks.
- Frequency of event monitor (every 5 minutes 24x7, every half hour M-F 9-5, etc): Every 15 minutes 24x7
- Additional email address(es) to receive notification, for both email warnings and notification of incidents (by default, Service Administrators listed in CMDB will be included in all warning and incident emails): Service administrators as defined in CMDB, and UFSteam@wisc.edu
- Person(s)' direct contact information in the event of incident (by default, 24x7 operators will refer to administrators listed in CMDB): See UFS Service administrator's information in CMDB, as well as the UFS team office phone number 262-XXXX
- Special SNCC 24x7 Staff instructions/directions/KB article reference for responding to this event monitor: See KB article #XXXXX
- CR submitted