PROVISIONAL: Datacenter Shutdown Policy

In the case of a disruption to the cooling of the datacenter, DiscoverIT has to make decisions about what resources should remain active and for how long. This document describes the policy for shutting down resources in the datacenter.


Automatic temperature shutdowns

As temperatures in the datacenter rise, we have servers configured to shut themselves down when they read their own temperatures rising to certain levels, as described below:

  • 27C: Non-occupant/CHTC resources.
  • 35C: Most (ie, those not listed in the next group) DiscoverIT managed physical servers.
  • 40C: All research storage nodes (ie, datavault, PI storage servers), Domain Controllers, Hyper-V Hosts, and Exchange mail servers.

Manual shutdown priorities

We typically try to make sure that datacenter temperatures do not rise above 90 degrees F while cooling is out. If the automatic shutdowns aren't sufficient to control temperatures in the datacenter, we will begin shutting down manual resources in the following order:

  1. Any non-occupant compute resources (for example, execute nodes) that haven't shut themselves down automatically.
  2. Any occupant compute resources (for example, execute nodes) that haven't shut themselves down automatically.
  3. Any remaining non-occupant resources
  4. Any remaining occupant physical servers that do not perform an infrastructure role (e.g. huisken-analysis-1)
  5. Research storage servers.
  6. DiscoverIT infrastructure and email.

CHTC Shutdown priority list

As we shut down non-occupant/CHTC devices, we try to be sensitive to priorities about what resources to leave up the latest, and what can come down right away. Below is the current (as of 7/2020) outline of the shutdown priorities for CHTC resources:

Priority 4 (shut down first):

  • e* (execute nodes)
  • exec-* (BaTLab execute nodes)
  • batlab-* (BaTLab execute nodes)

Priority 3

  • Anything not listed in the other priorities

Priority 2

  • squid*
  • hypervisor*
  • deepdive* (deep dive servers)

Priority 1 (shut down last)

  • submit-*

- submit*

Keywords:cooling datacenter shutdown   Doc ID:103858
Owner:Andrew M.Group:DiscoverIT
Created:2020-07-09 16:29 CSTUpdated:2020-07-23 14:14 CST
Feedback:  0   0