Extensive publicity about gathering and use of personal data by popular online services has increased privacy concerns, especially in developed countries. This has led to consideration and passage of privacy regulations in many parts of the world. The most visible example is the European Union's General Data Protection Regulations (GDPR) which define a new regulatory framework for the management of personal data. These regulations are colliding in unexpected ways with Internet Service Providers desire to protect their subscribers from malicious activity.
As the May 2018 deadline for implementation of the GDPR regulations drew closer security researchers realized there was potential impact on the use of the who is database that stores data about domain name registrations. The who is database was widely used for security research because it contains useful information about domain name registrations like who is registering the name, their contact information(email), location and more. Since use of domain names is fundamental to activating and maintaining most security exploits, data about their heritage is useful.
The International Corporation for Assigned Names and Numbers (ICANN), the organization responsible for administering the who is database, has defined a temporarily specification that pares back data fields in who is significantly so it is compatible with GDPR. Information that’s been useful for security research in the past isn’t available. ICANN has convened a group to develop a long-term solution but it’s not clear where it will lead.
There are other examples in the past where proposed privacy regulations had the potential to impair security research by limiting availability of data. In early 2016 the United States Federal Communications Commission began to formulate regulations that would have restricted gathering of various kinds of network data. In this case the industry and research community collaborated and advocated for revisions that would ensure privacy while allowing for capture and use of properly anonymized network data.
It’s inevitable collisions between privacy and security will continue to occur. Solving the problem of diminishing data availability means security researchers have to maximize the utility of security data that remains. Security research will need to move from rigid, deterministic, and rule-based, where personal information was helpful; to behavioral, anomalies-based analysis across very large volumes of anonymized data. The future calls for overlaying multiple layers of data where no single layer produces a result.
This will require highly automated processing and machine learning. Advanced algorithms can expand coverage of activity related to known threats, and discover previously unknown attacks, without compromising precision (generating false positives). High-performance processing of real-time data canal so improve agility, or how quickly threats are found. There’s also the possibility of reducing research costs by extending the efforts of human experts with machines.
This paper will cover a recent example of privacy regulation impacting security research by outlining the issues that led to the who is problem and compliance with GDPR. It will then discuss a way forward: applying modern data processing techniques to large data sets to expand threat coverage, improve precision, and increase agility. A production machine learning system that analyzes live streamed, anonymized, DNS data gathered from DNS resolvers serving active Internet users all over the world will be described, along with the results it can generate.