In any large modern company, data has become the lifeblood of the organization and databases have become the beating hearts which supply that vital resource to every aspect of the company. Given that the failure of a single database, even for a short amount of time, can potentially lead to hundreds of thousands of dollars in lost revenue, it has become imperative to ensure complete reliability of every database within the ecosystem.
With hundreds to thousands of databases needing to be monitored, it has become increasingly difficult for Database Administrators (DBAs) to maintain adequate vigilance on every single database using standard monitoring techniques. Recently, companies have been turning to Machine Learning algorithms to “study” each database, determine if a database is displaying signs of distress, and then alert a DBA that action may be required on a given database.
One of the newest and most promising algorithms in use at Cox Communications is Density Based Spatial Clustering (DBScan). Fundamentally, the DBScan algorithm looks at groups of points which lie closer together (i.e. have a higher spatial density) and then assigns them to be in the same cluster. The process repeats until every data point has been assigned to a cluster, or else has been labelled an outlier.
It is these outliers, or anomalies, which may be harbingers of database problems.
Each night eight of the most important metrics, in five-minute increments, over the past thirty days of data are fed into the ML algorithm for each database. By using Principal Component Analysis, the data is converted from an eight-dimensional manifold to a three-dimensional surface and then used to create oneDBScan model per database. Given the trained model, whenever a new datapoint arrives, it is simply compared to the data in the pre-trained model to determine if the datapoint is “normal”, or if it is an anomaly which should be investigated further.
By operationalizing DBScan ML techniques on database monitoring data, database alerts have been accelerated by 15 minutes over existing monitors and decreased false positive alerts by a factor of six.