Kademlia DHT (1)

August 2021

Huidong Yang ✉

August 31, 2021

After diving into the Kademlia paper, I realized that there were lots of details that would deserve good elaboration, beyond summarizing its high-level design, which is a simple one, and simplicity is the secret advantage (but simplicity doesn't mean there's not much to it, e.g. Einstein considered his relativity theory simple). Given the abundant existing web resources that already did the summarization, what would be the point if without the extra care for details as the product of my own learning process? And details do matter. This is where math and code differ. The provable (worst-case) properties of Kademlia, which are no doubt foundational, came from a basic prototype, but it's the further optimizations on performance, security, and usability that made Kademlia thrive in the real world. The MaidSafe DHT (the topic of the next piece) is one such example.

Therefore, although I thought previously that I could do it all in a single piece, now I believe the more realistic (and ergonomic) approach is to have a multi-part series on Kademlia. This first one is about the very basics. But I believe they are important nevertheless.

Big Questions:

What is a distributed hash table, and how does it differ from a regular hash table?
What is the Kademlia routing table like?
How does a node build and refresh/update its routing table?
How to store and retrieve a value?
What are the performance characteristics?

Little Questions:

What is k and how is it chosen?
Why a node must keep all known nodes in its closest subtree even if there are more than k of them?
Why does the LRU-based eviction policy favor older nodes?

Q: What is a distributed hash table, and how does it differ from a regular hash table?

A regular hash table is a storage system (or data structure) such that stored data (values) can be quickly looked up (retrieved). Such quick lookups are achieved by associating each value with a unique key (or ID). You retrieve a value of interest by specifying its key. When you store a value, you also need to provide its key (alternatively, the key is auto derived from the value, as in Kademlia, or you can configure how the key is derived in certain object storage systems such as IndexedDB).

Now if all the data are stored in a single place, then it's fast and simple, but 1) the storage capacity is relatively limited and not very scalable, and 2) the system is fragile, in terms of both security (data loss) and performance (request overload). That is why we need a distributed version, where the data are stored in multiple places that are interconnected. Data retrieval will involve coordination between storage nodes in a network, by passing certain messages (the protocol of the DHT) around. When storage runs low, new nodes can join the network to expand the capacity. When existing nodes leave, the network will do its best to react, reconfigure, to maintain the functional integrity (e.g. restore the redundancy level of data, recruit more nodes, etc).

In essence, the storage/"database" functionality of a DHT is critically dependent on each node's awareness of who else are in the network, like the contacts on your phone. This information maintained by every node is its routing table. Each individual contact consists of the target node ID (or the "address" in the overlay network), and its physical network address (typically IP address plus the UDP port). That's why you may find it weird that most of the business logic in a DHT is about building and refreshing the routing table, rather than the data storage aspect (but hey, the routing system determines where to store and retrieve data, so really they are not two separate businesses.)

Lastly, a discussion about a technical difference between regular and distributed HT:

A regular hash table is internally made up of a certain number (say N) of data containers ("buckets" or "slots", but not to be confused with Kademlia's "k-buckets", which make up the routing table). Based on the key you provide, your data will be assigned to a certain bucket, numbered from 0 to N-1. This assignment is carried out by passing the key to a "hash function", which returns a number (say x). Typically x >> N, and thus this number must be mapped to a bucket index, which is often done by taking x modulo N.

When the hash table is about to be overloaded, it needs to be resized (allocating more buckets), and consequently, as N is increased, the data need to undergo reassignment to different buckets. Now in a distributed hash table, where each node in the network acts as a bucket, such "rehashing" is prohibitively expensive. Therefore, it needs a way of hashing that mostly avoids data reassignment when the number of nodes changes (namely as nodes join and leave the network).

For Kademlia, it takes a most simple approach: it uses the hash result (namely the data key) alone to determine the target bucket index (namely node ID), independent of how many nodes are in the network at the moment. Note that this means the node ID space has to be the same as the range of the hash function. For instance, typically Kademlia uses SHA-1 as the hash function, whose range is 160-bit numbers, and thus that space is also used to assign IDs to nodes. Now, there's one more issue that needs addressing. 2^160 node "seats" is a huge number, and thus it's common that the hash/key of a piece of data does not have a corresponding node ID at the time of the store request. The solution? It does not require exact match, but instead, it stores the data at the top-k closest nodes (using k many nodes for data redundancy), by designating a notion of distance (bit-wise XOR in particular) between numbers of fixed bits. This way, the joining/leaving of nodes won't cause global-scale data shuffling, but instead the impact is only local, in particular, if a node that just left was among the top-k nodes for storing a number of values, then the network will find other nodes to fill the role, and similarly when a node joins. Point is, this is cheap.

A side note re: the benefits of using the hash of a piece of data as its key. First of all, it's obvious that the clients had better not have the liberty to specify the key, otherwise there will be conflicts causing either request failures or data overwrites, and people could essentially control where in the network their data are to be stored, which is a vulnerability. Second, by deriving the key from the data itself using a hash function, it prevents storing duplicates in the network, saving network space instead of causing conflicts. Lastly, hash functions are good at randomly scattering data around the network, so that the nodes are load balanced.

Q: What is the Kademlia routing table like?

Building the routing table is essentially getting to know who else are in the network. But this information can be ginormous. If node ID is a 160-bit number, then there can be at most 2^160 nodes to know about. Kademlia uses a simple yet most effective way to partition the ID space into a much smaller number (160, or just the logarithm of the total possible nodes) of "distance classes", and for each class, a node has to know up to k nodes (k was 20 in the original paper, and such a class is called a k-bucket). In particular, for each node, it splits the space into 1/2, 1/4, 1/8, ..., 1/(2^160) of nodes, by simply using the current node's ID.

Kademlia does this partitioning mainly to place a low demand of network knowledge on individual nodes (who serve as request handlers), worst-case space requirement being k * log T, where T is the total number of possible nodes. Essentially, a node needs to be an expert on its nearby neighbors (namely can directly answer the question "what's node X's contact info", e.g. IP address plus UDP port), but only acts as a question forwarder when inquired about a far away node ("I don't know where node X is, but I know another guy Y who surely knows X better, so I will give you Y's contact info instead"). Note that this notion of "distance" is purely mathematical (in this case, bit-wise XOR), and intentionally, is uncorrelated to physical distance.

To see concretely how the partitioning works, let's take a smaller example, when ID is 4 bits (so there are 2^4 = 16 possible nodes), and we look from the POV of node 0000 and see how the space is split into log 16 buckets:

bucket 4: ID 1*** (8 nodes or 1/2, distance 8-15)
bucket 3: ID 01** (4 nodes or 1/4, distance 4-7)
bucket 2: ID 001* (2 nodes or 1/8, distance 2-3)
bucket 1: ID 0001 (1 node or 1/16, distance 1)

Using XOR distance, we can view these buckets as neighborhoods of node 0000, where bucket 1 is the closest, and bucket 4 the farthest.

In fact, this rule of partitioning is most naturally visualized using a binary tree, where left and right branches are consistently labeled 1 and 0, and the nodes are the leafs, and their IDs are the paths. Then visually, bucket 4 is the farthest and largest subtree, and bucket 1 the smallest and closest.

Now the above example shows a network that contains all possible nodes, namely a perfect binary tree. But in real-world networks where ID is 160 bits or more, the network has far fewer nodes compared to 2^160. So visually, do we still use a perfect binary tree? What happens if we do so? (To help visualize, draw the 4-bit perfect binary tree, pick e.g. 8 evenly spaced random IDs and mark them in the leafs.) Then we see at the leaf level, most nodes have no sibling. What does that mean for such an "only child"? The closest neighborhood ("bucket 1") is empty. In a more extreme case, if there is only one node ID that starts with 1, then its bucket 1, 2, and 3 are all empty, only bucket 4 (0***) contains nodes. Pre-allocating empty buckets is wasteful. The routing table should be built dynamically (more below).

In the paper, section 2.2 "Node state", it says a node stores a bucket for all 160 distance classes, so that seems like pre-allocating empty buckets. However, my interpretation is that this section is more for "complexity analysis" (worst-case space cost) rather than "software engineering", namely it's not concerned about implementation efficiency. While in section 2.4 "Routing table", it then talks, as a realistic implementation, about the dynamic growth of the routing table (which is viewed also as a "binary tree" that starts with one "node", namely k-bucket, and keeps splitting when necessary as new nodes are encountered).

Specifically, the routing table of a node starts as a single k-bucket covering the entire ID/distance range, and then follows a specific rule of splitting: when a bucket is full upon encountering a new contact, see if its ID range contains the node's own ID (alternatively, in terms of distance range, see if it is the closer of the two sibling buckets), if yes, then it splits into two buckets, each covering half the range of the old bucket, and then add the new contact; if no, then don't split, instead, use an LRU-based rule (more later) to decide whether the new contact should replace an existing dead one, or be discarded (Note: this eviction policy is discussed in section 2.2 Node state; however, in section 2.4 Routing table, the paper instead says "the new contact is simply dropped", which I suspect is a mistake? I don't think the paper means to literally store both "node state" and "routing table" as two separate data, with different rules of construction.)

So we see from the way the routing table is built, that a node dedicates more resources (k-buckets) to learning about contacts that are closer to it.

Now besides this dynamic process that happens in a real network, there is also a static way of visualizing the node state or routing table against a snapshot of the network, namely, when no node is joining or leaving the network. Specifically, for any particular node in the snapshot, how many k-buckets will it need to build? Again, if the network were full, then we know the answer would be 160. But typically, the node IDs are sparse, and most of the "theoretical" buckets don't need to be considered. In section 2 System description, the paper presents a way to "compress" the original network where every node is a leaf at the bottom (160th) level of a perfect binary tree, into a minimal representation such that the compressed path (namely ID) of a node indicates how many k-buckets it will need to build (against the network snapshot).

For starters, let's assume the node IDs form a fairly balanced binary tree (which can be achieved by a "super-node network" for example). Ready? First, we know that for a node, the closer a distance class (k-bucket) is, the fewer (exponentially fewer) total possible nodes can be in the distance range (or equivalently, node ID range). E.g. the very closest distance class can possibly have at most 1 node. Naturally, such closer buckets are more likely to cover no node that exists in the snapshot. Therefore, an effective approach to compress the routing table or node state is to "cut the thin tail" of small-range buckets, that is, to forget about all such nearby buckets, starting from the closest ("bucket 1" in the 4-bit example above), until reaching a bucket that contains at least one node that exists (but is not necessarily encountered) in the current network. That is, the routing table will build 160 - (j - 1) buckets, where the j-th bucket is the closest but non-empty bucket. It turns out, this number is the same as the length of the shortest unique prefix of the node's ID in the snapshot network. In a way, the length of the "thin tail to cut" is j - 1.

To better understand this, we need to see how this thin tail of the j - 1 closest buckets corresponds to the lowest j - 1 bits of the node ID that are not necessary to uniquely identify the node in the current network. Needless to say, any path in a binary tree is unique, but the question is how to compress a path to its minimum. To find the solution yourself, I again recommend drawing a 4-level perfect binary tree and pick 8 random 4-bit IDs to draw their paths on the network (initially all ending at the bottom level), and then find a way to derive their shortest unique prefix. The key is to observe that, from bottom up, the very first branching point is where you are no longer unique (because there is at least one other node that also goes through this path), so that means the shortest unique path is just one extra bit down that branching point. Note interestingly, when you compress the original node ID to its shortest unique prefix, this new leaf node in the network naturally has a "sibling subtree", which can be either another leaf, or the root of some lower-level leafs, and this sibling subtree corresponds to the closest non-empty k-bucket! And if you go up along this shortest unique prefix, at each higher branching point, you will see another subtrees, which correspond to a farther-away k-bucket. Since the length of the shortest unique path is equal to the number of branching points, we see that after the path compression, the path length indicates how many k-buckets a node's routing table will build against the network snapshot.

Note: in contrast to this bottom-up process of path shortening (essentially telling us how many nearby buckets are doomed to be empty and thus should be ignored), the dynamic construction of the routing table is a top-down process where it starts with one bucket covering the entire ID range and the splits recursively at the near half-side of the range, so it doesn't pre-calculate how many nearby buckets it can skip building. After all, th dynamic process is what really happens, and for a node, it doesn't need to know about this space saving :) Again, my interpretation is that the snapshot analysis based on shortest unique path is only for humans to understand and appreciate the fact that given a large enough ID space such as 160 bits, and a way to maintain a fairly balanced ID assignment, the number of k-buckets that need to be built and maintained is typically far less than the full ID length. This "tree compression" is intrinsic and transparent, not something you need to do in the algorithm.