# C2E Endpoint Estimates

Description of the method used to estimate endpoint counts for Cybersecurity's C2E initiative

## Executive summary

Part of UW-Madison’s Cybersecurity to the Edge (C2E) initiative entails estimating the number of UW-Madison-owned endpoints running the Qualys Cloud Agent and/or Cisco Secure Endpoint (formerly Cisco AMP), along with the progress of each division in attaining full deployment of the software on all of its endpoints. While the endpoint counts running the software are easily obtained, the total number of endpoints in each division isn’t readily available. To address this gap, a linear model was constructed to leverage known data points to estimate total endpoint counts per division.

First, MAC address counts were gathered from the Authorized Agent Network Tool Suite (AANTS) database, which provides a current count of network-connected devices per IP subnet. This data is of limited quality, as it is difficult to correlate IP subnet to UW division. Second, full-time employee (FTE) counts were collected from UW’s Tableau instance. Next, endpoint counts were gathered from departments having full or near-full deployments of Qualys and Cisco Secure Endpoint. Various linear models were fitted to estimate a quantitative relationship between endpoint counts, MAC addresses, and FTEs. To account for units having an atypically large number of endpoints (for example, central service providers), additional categorical variables were manually applied to the data using institutional knowledge. An additional categorical variable was applied to DoIT departments to account for the comparatively high number of endpoints relative to other campus units.

The best model exhibited a strong statistical correlation between endpoint count and FTE count (including adjustments for large and DoIT units). Further analysis using commonly accepted methods indicates that the model is statistically valid. The total predicted endpoint estimate was 86,100; but due to the uncertainty of the model, this value could range from 43,600 to 169,000. Based on current deployment totals, we estimate the campus to have deployed Qualys, Cisco Secure Endpoint, or both on approximately 40,900 devices (47.5%).

It should be noted that the counts generated by the model are rough estimates only and should not be used as a checklist against which a particular unit’s progress can be measured. It is further noted that these counts only estimate the number of endpoints upon which Qualys and Cisco Secure Endpoint can be installed; it excludes devices such as printers, IOT devices, and devices running operating systems that are incompatible with Qualys and Cisco Secure Endpoint.

## Objective

This analysis was performed as part of UW-Madison’s C2E initiative, which aims to assess the extent to which the university has deployed Cisco Secure Endpoint and Qualys Cloud Agent software on its endpoints. Because departments or divisions maintain their own device inventory (or lack a device inventory altogether), it has been historically difficult for DoIT and the Office of Cybersecurity to assess how many endpoints each division has.

The goal of this analysis was to generate rough endpoint estimates for each division. Our approach was to gather data from departments and divisions from which we have known endpoint counts, fit a linear model to the data, and predict endpoint counts based on the coefficients generated by the model.

The data we have with which to create this kind of model is of limited quality. First, we have FTE counts per department and division (we used March 2022 payroll data). Second, we have MAC address counts mapped to IP addresses over a thirty-day period during July and August 2022. While the quality of the FTE counts itself is high-fidelity, endpoint counts and FTE counts don’t correlate precisely. Likewise, the fidelity of the mac-address data suffers from several problems. First, it is difficult to correlate IP addresses (and, hence, MAC addresses) back to a UDDS code. We created a separate model to do this, but the model is only 81% accurate against validated training data and is likely much less accurate against untested data. Second, the MAC address data doesn’t account for endpoints that are connected to public networks like UWNet or Eduroam, nor does it account for VPN-connected or other remote users.

In addition to FTE and MAC address data, a categorical variable (large_unit) was employed to account for divisions and departments having atypically high endpoint counts, for example units that host services for other campus units and, hence, may require large numbers of servers. We also treated DoIT as a special unit, since it runs many services for campus and will, therefore, have an atypically high endpoint count.

Despite the limitations described above, the model may serve as a rough estimate in the absence of any other means of quantifying endpoint counts. These numbers should only be used with a heavy disclaimer and should not be used as a checklist against which to measure the true progress of any school, college, or division in deploying security software to their endpoints.

## Data sources

### Mac address count data

For the first data set, we queried MAC address counts from the AANTS database for the last 30 days. There were approximately 32,000 subnet records in the AANTS database at the time we ran the model. To be useful in this analysis, MAC address counts had to be mapped to a UDDS code; historically this has been a challenge for the UW since AANTS does not tie a subnet record to UDDS. Each subnet is, however, tied back to the staff members who administer it. And from these staff members we can query UW’s WhoIs database to glean the UDDS code for each of those staff members. This isn’t always reliable, as the staff member isn’t necessarily a member of the UDDS the subnet belongs to. For example, SMPH network staff are listed as administrative and technical contacts for almost every SMPH subnet; but those staff members have a UDDS code corresponding to SMPH Administration rather than the subnet's actual SMPH department. Likewise, DoIT Departmental Support (DS) administers many departments’ endpoints on behalf of those departments, erroneously resulting in a UDDS code of A06.

### Subnet-to-UDDS correlation

To overcome these challenges, we combined the contact-based UDDS predictions with a natural language processing (NLP) model based on document similarity. The subnet names and descriptions were quantitatively compared with terms scraped from websites mapped back to a particular UDDS. To calculate the “best” UDDS match, the UDDS with the smallest Jaccard distance was weighted more heavily than those with larger distances. Additional institutional knowledge was also used to weight some UDDS predictions more heavily than others. The resulting data included a set of records for each subnet-UDDS combination ordered by the record having the UDDS with the highest probability of being correct.

### FTE counts

With the aim of correlating server counts directly to FTE counts, we also collected March 2022 FTE counts from UW’s Tableau instance. This data was a simple listing of seven-digit UDDS code along with FTE count rounded to one decimal place.

### Qualys and Cisco Secure Endpoint counts

Counts of Qualys and Cisco Secure Endpoint deployments for each campus unit were obtained by querying each application's API.

### Units with the most complete data

In order to estimate device counts for units that have incomplete or no data, we gathered a list of units that have full or relatively full deployments of either Cisco Secure Endpoint or Qualys; there were forty-seven such units, a sufficiently large sample size with which to make further predictions. We correlated device counts from those departments to MAC address and FTE counts to evaluate whether a statistically significant relationship exists. We then fit a linear model to the data and, using the coefficients of the model, estimate device counts for the remaining campus units for which we have no data or incomplete data.

## Data wrangling

The data needed to be grouped by division code and aggregated. The linear model attempts to model endpoint counts based on real counts, but the real counts available to us consisted of separate Qualys and Cisco Secure Endpoint counts. A single number was needed for the “real” count. For this, we used either the Qualys count or the Cisco Secure Endpoint count, whichever was higher. We called this single number the “endpoint count.”

## Plots

Scatterplots were generated to visually compare MAC address counts and FTE counts to endpoint counts. As shown, endpoint counts are generally proportional to both MAC address and FTE counts.

A boxplot of endpoint counts against the "large_unit" classification illustrates good correlation, with only a few outliers at the high end of each plot.

## Linear modeling

For a first pass, we tried fitting linear models to various combinations of predictors without doing any mathematical transformations. Endpoint counts were fitted against the following predictors (a plus sign indicates an additive relationship, while a colon indicates a multiplicative interaction between predictor pairs):

mac_ct

fte_ct

fte_ct + mac_ct

fte_ct + mac_ct + fte_ct:mac_ct

mac_ct + large_unit

fte_ct + large_unit

large_unit + fte_ct + mac_ct

large_unit + fte_ct + mac_ct + fte_ct:mac_ct

fte_ct + large_unit + fte_ct:large_unit

fte_ct:large_unit

fte_ct + large_unit + doit_unit

fte_ct + large_unit:fte_ct + doit_unit:fte_ct

fte_ct + large_unit:fte_ct + doit_unit

fte_ct + large_unit + doit_unit:fte_ct

fte_ct + large_unit:doit_unit

Using commonly accepted statistical approaches, we analyzed the best-performing models and looked for leverage points (outliers), i.e., points that have a strong influence over the resulting model. The following summarizes the model results:

Model Run |
Model Formula |
Insignificant Coefficients |
Adjusted R-squared |
MSE | Breusch- Pagan |
Shapiro |
---|---|---|---|---|---|---|

1 | 1: endpoint_ct ~ mac_ct | 0 | 0.887 | 141438 | heteroscedastic | not normal |

2 | 2: endpoint_ct ~ fte_ct | 0 | 0.558 | 360794 | heteroscedastic | not normal |

3 | 3: endpoint_ct ~ fte_ct + mac_ct | 1 | 0.891 | 130467 | heteroscedastic | not normal |

4 | 4: endpoint_ct ~ fte_ct + mac_ct + fte_ct:mac_ct | 2 | 0.962 | 43757 | heteroscedastic | not normal |

5 | 5: endpoint_ct ~ mac_ct + large_unit | 1 | 0.886 | 137251 | heteroscedastic | not normal |

6 | 6: endpoint_ct ~ fte_ct + large_unit | 1 | 0.552 | 356243 | heteroscedastic | not normal |

7 | 7: endpoint_ct ~ large_unit + fte_ct + mac_ct | 2 | 0.889 | 127740 | heteroscedastic | not normal |

8 | 8: endpoint_ct ~ large_unit + fte_ct + mac_ct + fte_ct:mac_ct | 0 | 0.972 | 30320 | heteroscedastic | not normal |

9 | 9: endpoint_ct ~ fte_ct + large_unit + fte_ct:large_unit | 2 | 0.557 | 343460 | heteroscedastic | not normal |

10 | 10: endpoint_ct ~ fte_ct:large_unit | 0 | 0.547 | 360690 | heteroscedastic | not normal |

11 | 11: endpoint_ct ~ fte_ct + large_unit + doit_unit | 0 | 0.960 | 30652 | heteroscedastic | not normal |

12 | 12: endpoint_ct ~ fte_ct + large_unit:fte_ct + doit_unit:fte_ct | 0 | 0.959 | 31668 | heteroscedastic | not normal |

13 | 13: endpoint_ct ~ fte_ct + large_unit:fte_ct + doit_unit | 0 | 0.959 | 31668 | heteroscedastic | not normal |

14 | 14: endpoint_ct ~ fte_ct + large_unit + doit_unit:fte_ct | 0 | 0.960 | 30652 | heteroscedastic | not normal |

15 | 15: endpoint_ct ~ fte_ct + large_unit:doit_unit | 0 | 0.960 | 30652 | heteroscedastic | not normal |

When evaluating a model's performance, a number of factors must be considered. First, only predictors that are statistically significant should be included in the model as coefficients. Next, the model's residuals (how far away the data deviate from the line of best fit) must be homoscedastic and normally distributed. That is, they should vary more or less constantly and should not exhibit any patterns. If the residuals are heteroscedastic, this typically indicates the model isn't valid and that a mathematical transformation should be performed. Residuals that are not normally distributed could indicate an invalid model but are generally less of a concern when there are at least ten observations per variable (which is true of our data set).

After one or more models have been validated, the best model can be selected by either maximizing the adjusted R-squared value or minimizing the mean squared error (MSE). However, as shown in the table, none of the models yielded homoscedastic residuals. This can be visually illustrated by an example residual plot from the first model run:

As shown on the plot of residuals vs fitted values, there is noticeable heteroscedasticity, which invalidates the model. This is confirmed quantitatively by the Breusch-Pagan test, which generated high test statistics for the models as well as a low p-value. For a model to be considered homoscedastic, a low test statistic and p-values greater than or equal to 0.05 are required.

## Transformations

Because the models thus far resulted in heteroscedastic residuals, a common approach to correct for this is to mathematically transform one or more of the variables. We tried several transformations on the response variable (endpoint count), including log, square-root, and cubed-root. We also used a commonly accepted approach to estimate powers to which a predictor could be raised to maximize best fit. We used this approach (the Box-Cox method) on some of the better-performing models to evaluate whether any improvement could be made.

The following summarizes the model runs that exhibited homoscedastic, normally distributed residuals, ordered by the highest adjusted R-squared score:

Model Run |
Model Formula |
Insignificant Coefficients |
Adjusted R-squared |
MSE | Breusch- Pagan |
Shapiro |
---|---|---|---|---|---|---|

48 | 48: (endpoint_ct)^(1/3) ~ fte_ct + large_unit + doit_unit | 0 | 0.881 | 0.821 | homoscedastic | normal |

57 | 57: (endpoint_ct)^(1/3) ~ fte_ct + large_unit + doit_unit:fte_ct | 0 | 0.881 | 0.821 | homoscedastic | normal |

60 | 60: (endpoint_ct)^(1/3) ~ fte_ct + large_unit:doit_unit | 0 | 0.881 | 0.821 | homoscedastic | normal |

30 | 30: (endpoint_ct)^(1/3) ~ mac_ct + large_unit | 0 | 0.781 | 1.861 | homoscedastic | normal |

43 | 43: log(endpoint_ct) ~ fte_ct:large_unit | 0 | 0.655 | 0.452 | homoscedastic | normal |

63 | 63: tendpoint_ct3 ~ tfte_ct3 | 0 | 0.638 | 0.001 | homoscedastic | normal |

28 | 28: log(endpoint_ct) ~ mac_ct + large_unit | 0 | 0.637 | 0.406 | homoscedastic | normal |

19 | 19: log(endpoint_ct) ~ fte_ct | 0 | 0.629 | 0.497 | homoscedastic | normal |

71 | 71: tendpoint_ct4 ~ mac_ct + large_unit | 0 | 0.600 | 0.001 | homoscedastic | normal |

16 | 16: log(endpoint_ct) ~ mac_ct | 0 | 0.524 | 0.555 | homoscedastic | normal |

## Model selection

Based on the above results, we selected the best-performing model that also conforms to the assumption of equal variance, normality of residuals, and independence of predictor variables (model run #48). The selected model fits the cubed root of endpoint counts against the FTE counts with additional corrections for large_unit and doit_unit. The adjusted R-squared value is high (0.881), which indicates a very strong relationship between endpoint count and the predictor variables. The mean squared error is also low (0.821), further confirming a good fit. Visual evaluation of residuals also confirm the model's validity:

Based on a visual analysis of residuals and on evaluating the model using the Breusch-Pagan test, the residuals are homoscedastic, satisfying a critical component of model validation. An additional point of validation is that the residuals be close to normally distributed; based on a visual observation and on the Shapiro test, this condition is also satisfied. In addition, the small p-values generated by the model indicate that all three predictors are statistically significant. As a final point of confirmation, variance inflation factor (VIF) values were examined to evaluate whether any colinearity exists between predictor pairs. The values ranged between 1.06 and 1.65, indicating very little colinearity.

In summary, there is a very strong, statistically significant relationship between the cubed root of endpoint count and the predictors used in the model. The model appears to be valid both by visual observation and by numerical calculation using commonly accepted testing methods. The resulting model can be expressed as follows:

(endpoints) ^ 1/3 = 3.78638642 + (0.01604151 * fte_count) + (2.96191105 * large_unit) + (6.21825879 * doit_unit)

## Confidence intervals

90% confidence intervals were generated for the coefficients of the best model:

Parameter | Value | Margin | Low | High |
---|---|---|---|---|

Intercept | 3.79 | 0.385 | 3.4 | 4.17 |

fte_ct | 0.01609 | 0.00309 | 0.013 | 0.0191 |

large_unit | 2.969 | 0.784 | 2.18 | 3.74 |

doit_unit | 6.22 | 2.07 | 4.158 | 8.28 |

## Predict endpoint counts

Using the selected model, we predicted endpoint counts based on the predictor variables that exhibited the best correlation (FTE counts, large_unit, and doit_unit).

Total endpoints (90% lower bound) | 43,600 |

Total endpoints | 86,100 |

Total endpoints (90% upper bound) | 169,000 |

Endpoints deployed | 40,900 |

Percent deployed | 47.5 |

Deployment counts of Qualys and Cisco Secure Endpoint were also estimated per departmental UDDS code, but to preserve anonymity these itemized counts are not shown here.

## Conclusion

In conclusion, there is good statistical evidence to support a very strong linear model correlating endpoint count to FTE count, with additional categorical variables of large_unit and doit_unit. Commonly used techniques to evaluate the model’s residuals indicate that the model is valid. However, the 90% prediction interval includes a wide range of values, resulting in a total predicted endpoint point estimate of 86,100, with lower and upper bounds of 43,600 and 169,000, respectively. Based on current deployment totals, we estimate the campus to have deployed Qualys and AMP to approximately 40,900 computers (47.5%).