09 Aug 2020
The IETF is standardizing SNI encryption and DNS encryption in a bid to improve privacy. Both techniques rely on hiding private traffic in a crowd of benign traffic and are mostly useful if such crowds exist. I set up to estimate that, and the results so far are not optimistic. There is some concentration of Internet services on big Content Distribution Networks, Cloud Providers and Hosting Providers, but “one IP address per server” remains the rule, which defeats these privacy techniques: why look at DNS data or TLS SNI parameters when IP addresses are almost as informative? The techniques are only useful today for the minority of servers that are sharing addresses with others, and that seems to be a small minority.
Many privacy strategies rely on a form of “hiding in crowds”. For example, clients that uses TLS and SNI encryption to hide their Internet activity. Observing the network only reveals the IP address of the client and the server, but SNI encryption hides the specific service that the client is accessing. On the other hand, the protection is only effective if the server manages many services. If there are just a few, observers can easily make educated guesses. In short, hiding the SNI only works if the server manages a crowd of services in which the client can hide.
The same issue arises with DNS access. Protocols like DNS over HTTPS are designed to hide the DNS requests of the to gain some privacy, but simply directing DNS requests to a different service than the local ISP merely displaces the problem. Does sending all your DNS requests to Google or Cloudflare really improve privacy over sending them to Comcast or Verizon? An interesting idea is to split this stream of requests towards multiple servers, as indicated by the server’s preferences. For example, if all the requests for services served by Akamai were served by the DNS servers of Akamai, these requests would not tell Akamai anything that it could not already deduce from the incoming traffic, while hiding the information from all other actors. Of course, the requests for servers managed by Cloudflare would be sent to Cloudflare, etc. That interesting strategy for DNS queries is another variant of hiding in crowds.
But do we have such crowds? I set out to measure that, and the results so far are somewhat mixed. The idea for the measurement is to start with a list of popular names, and to investigate how the corresponding content is served. I started with the “Majestic Million” list, which contains a million service names ranked by some order of popularity. I will eventually explore all the list, but for now I started with the first 25,000 names. I wrote software look for the IP address of the service, and then look for the “autonomous number” (AS) to which that address belongs. The AS groups a set of network addresses that follow common routes through the Internet, i.e., a set of network prefixes allocated to a single network provider and managed the same way. My first definition of the crowd is thus, “all services that are served from the same AS”. As seen on the graph below, I do indeed observe some concentration: 50% of the services are served by the first 12 ASes, and 70% by the first 100 ASes.
The top 20 ASes in the list are:
These names are largely what we expect: content distribution networks (CDN) like Cloudflare, Fastly, Akamai, or Incapsula; cloud providers like Amazon, Google, Microsoft or Alibaba; and several large Internet providers. Automattic is the company managing Wordpress, which hosts many blogs. OVH is a French cloud provider. Hetzner, Digital Ocean, Linode and Unified Layers provide hosting services. Lee Enterprises manage several news publications. These top providers host many sites and could perhaps provide some amount of “hiding in crowds”. On the other hand, the crowd thins out quickly as we go down the list. The top 25,000 names are served by a total of 3342 ASes, and the average number of sites per As for the bottom 3322 ASes is less than 2. Not much hiding there.
ASes are one thing, but what about IP addresses? The per AS statistics tell us about the concentration it the CDN and the cloud markets, not about individual servers. But techniques like encrypted DNS or encrypted SNI will not provide much privacy if the name of the server can be deduced easily from the IP address. To get privacy, we need address sharing, not just colocation of servers. We can estimate that by tabulating ne number of sites using the same IP address. Here is a list of the top ASes and the average number of domains per IP address in each:
The pattern that emerges is clear: with a few exceptions, CDN and hosting providers strive to allocate separate IP addresses to each of the hosted sites. In most cases, the IP address is unique to the domain, and techniques like Encrypted SNI or Encrypted DNS do not provide much additional privacy. There are some exceptions of course: Fastly, Automattic and Lee. For now, these are exceptions, not the rule.
Of course, my numbers shall be taken with a grain of salt. I have only explored 25,000 sites so far, getting bigger numbers would provide a better picture. I am starting from the Majectic Million list, other lists might give different results. The Majestic Million list ranks sites based on their popularity, measured by the number of references from other sites. I could not resolve 1649 of the first 25000 names, for a variety of reasons. Some names like "nl.ca" do not correspond to actual sites, some other names have no IP addresses, and a few names have IP addresses in subnets not routed in the copies of the BGP tables that I use. And there are of course some configuration mistakes, IP addresses listed as 127.0.0.1 or 0.0.0.0. I excluded all these 1649 names from the statistics.
The software used for the data collection and the data files used for the configuration are available at https://github.com/private-octopus/centraldns. Suggestions on how to improve these statistics are most welcome!