01 Dec 2011
Many peer-to-peer systems rely on Distributed Hash Tables. DHT allow publication of indexed data through a set of cooperating nodes. For example, in the P2P SIP design, a SIP node can publish its current location, e.g. "host-29.some-network.net," indexed by the SIP URL of the user, e.g. sip:alice@example.com. The publication process starts by hashing the index to the corresponding hash code, typically a large integer. The message is then pushed to the DHT node whose index is closest to the expected hash code. When someone later wants to contact sip:alice@example.com, they will in turn compute the hash of the SIP URL, retrieve the data indicating the current location of Alice, and establish a P2P SIP session.
This is well and good, but not very private. Anyone can access the DHT and retrieve Alice's location. In fact, we see examples of that in Bit Torrent systems that also rely on DHT to publish the location of files. That's quite convenient, as it eliminates any need for a central repository of file locations. But if they publish copyrighted material, the copyright holders or they representatives can find out and sue them. The data published in a DHT is just as public as if it was published on a big web service! Anyone can simply use a DHT client, seek the record describing where a specific file is published, and find the address of the publisher.
Suppose that we want to design a system where "only my friends can see me." We will still want to use an existing DHT, because engineering a new DHT is really hard. So, we need to rely on some form of encryption to ensure that the communication remains private. Let's go back to our example of P2P SIP, and assume that Alice wants to publish information for all her friends. In her first attempt at privacy, she makes two changes:
She has tried to achieve two goals. Since only her friends know that "jabberwocky2011" is Alice, third parties cannot know whether Alice is online. They may be able to scan the DHT and find that there is a record under the "jabberwocky2011" index, but in theory they cannot link that to Alice. And in any case, as long as only her friends know the secret key used to encrypt the location data, only her friends will be able to know her location.
All that may work well, until something happens. Suppose that her former friend Eve becomes not so friendly anymore. Eve knew the random index used by Alice, and she also knew the key used to encrypt the location. She can turn that information to third parties. This is exactly how many French underground cells fell during WW2. The Gestapo would somehow manage to arrest a member of the cell and make him speak, and he would turn out information leading to the other members. The cell could only remained safe if it quickly noticed the former member's arrest, cut all communication with him, and change all passwords and other secrets that the former member knew. If she wants to retain her privacy, Alice would have to do exactly the same thing: cut Eve off, create a new random identifier and a new password, and tell all that to her remaining friends. That may now be practical.
The first design that Alice envisaged is flawed because the same secret information is shared with all her friends. Thinking again, Alice devises a new system, in which she exchanges different identifiers and secret keys with each of her friends – one for Bob, another one for Carol, and yet another one for Eve. When she comes online, she publishes as many location records as she has friends. Her friends know which specific record they should retrieve, and they can only decrypt the individual record meant for them. As soon as Alice learns that Eve cannot be trusted anymore, she destroys the identifier and secret key dedicated to Eve. She does not need to destroy the information relating to her other friends, and she can still communicate with Bob and Carol.
Alice believes that she has done well, but there is still an issue. DHT nodes cannot be all trusted, yet they can find out Alice's location, or at least her IP address, at the time of publication. The request to publish information in the DHT is sent over the network, and the DHT nodes can find the message's origin. When Bob retrieves Alice's location, the DHT nodes can also find Bob's address. They may not understand the random index and the encrypted value of the location message, but they can find out that the message was published by a certain IP address and retrieved by another IP address, and thus infer that there is a relation between these two IP addresses.
A quick web search turns out at least one paper discussion this issue, "Agyaat: Providing Mutually Anonymous Services over Structured P2P Networks," by Aameek Singh and Ling Liu from Georgia Tech. Their design proposes to add to the DHT a layer of "onion routing," similar to Tor. The message requesting publication or retrieval would be first routed through a set of friendly DHT nodes to obscure its actual origin. I don't know whether that's necessary or practical, but that's certainly an option.