MaidSafe DHT (1)

September 2021

Huidong Yang ✉

October 19, 2021

Last month I went through David Irvine's paper on MaidSafe DHT. Unlike the Kademlia paper, it is not a complete description of the design, but rather a very brief summary of various optimizations meant to make the original Kademlia greater. I know there is much more to learn about MaidSafe DHT, and probably the best way is to dive into the code (esp. the routing crate), but as an intro piece I think it's still worth the effort to look at what MaidSafe is trying to achieve on the high level that is beyond Kademlia, and why. As a side note, Irvine is not an academic (in the traditional sense), and it shows in the whitepaper, which isn't even proofread: He is a very practical guy who is concerned with what is actually made than what is said (although he is capable of verbally expressing his ideas in captivating ways). My point is, so far one cannot expect to fully learn the tech by only reading the whitepapers (the early documents). The tech is also evolving constantly. We with some formal education (mainly passive) seem to be spoiled and expect a set of papers out there that can cover and explain it all for us in a neat way. Not in the real world! Information needs to be hunted for, digested, integrated, and continually updated, all on our own. That's what the frontier is like.

So if the original Kademlia paper gave the false impression of academic comfort, now is the time to get real and be the info organizer yourself. To kick things off, I'm going to jot down the superficial impressions I so far hold regarding the design of MaidSafe DHT. Note that other vital aspects of the Safe Network, such as encryption, authentication, storage (via CRDT), and economy (via DBC), are outside the scope.

Design Highlights

So how is MaidSafe DHT different from Kademlia?

Kademlia is only really good for storing and retrieving immutable data (write once, read a lot), such as file sharing, but handling data mutation consistently and efficiently is hard. MaidSafe DHT aims to be more versatile, including serving as a DynamoDB-like database.

Kademlia is purely a routing algorithm, while MaidSafe DHT is meant to be actual production-grade p2p software; So unlike Kademlia, which only assumes the bare minimum from the transport layer (namely UDP), MaidSafe takes advantage of the state of the art in network protocols (currently using QUIC, previously Crust, in-house) that offers so-called "managed connections" that are not only predictably reliable (like TCP), but also high-performance, such that the peer-to-peer network can be more real-time. This way, dead node removal, for instance, can be accomplished by sending cheap probing messages at real-time, as opposed to relying on periodic refreshing (e.g. hourly, as in original Kad paper), which is itself a potentially expensive process that warrants optimization algorithms of its own.

Disjoint sections with group consensus: in Kademlia, all nodes are equal, while here a group of old, reliable nodes are promoted to "elder" status to govern a disjoint section of the network, specified by a prefix of ID shared by all members of the section. A split will occur when a section accumulates too many nodes (e.g. beyond 200), namely the resulting two sections will have a prefix one-bit longer than before. Essentially these groups of elders form a super-node network that takes more critical, admin responsibilities beyond what a regular node ("adult") does, i.e. merely storing data. Indeed, most of the logic complexity of the network lies in what these elders do throughout the lifecycle, but whatever they do, they cannot without reaching super-majority consensus in the group (e.g. of size 7), and that is one of the key security mechanisms of the network (it's much harder to corrupt multiple independent agents than a single one). A major advantage of partitioning the network (by ID prefix) into sections with a limited capacity is that the max routing burden is up-bounded, so within a section, every member can easily afford to know who else are there. Recall in Kad, each node is responsible to cover the entire network ID space in the routing table (one just needs to know far less for IDs far away), so it still takes time to gradually build up the routing table while they encounter others in message-passing activities. In contrast, here a new node will receive the entire member list all at once upon joining a section, and only the elders in a section will also have to accumulate knowledge about other sections' members in order to handle inter-section routing. Recall in the Kad paper, routing table size is capped at k * N, where k = 20 (size of k-buckets), and N = 160 (ID bits), which is still quite something at max. Now in the Safe Network (SN), N = 256, so indeed it requires this added hierarchical structure to make routing more manageable for the bulk of the nodes (there are a lot more adults than elders, and only sufficiently reliable and high-performance adults get to be promoted).

Node lookup is truly recursive rather than iterative as in Kademlia (although the Kad paper used the term "recursive"). Note that I'm yet to see this explicitly in actual code, but I suppose conceptually, recursive means that node A receiving the initial request from node 0 will forward the message to another node B if A knows that B holds better knowledge about the closest nodes to the target ID, until eventually some recipient node X knows that no other nodes are better experts on the query than itself, after which X sends the final answer back to node 0. In contrast, recall in Kad, we always have the original requester iteratively send the message to a number (alpha) of nodes that should progressively know the answer better, until no closer nodes to the target ID are reported back. Now the question is, why is recursive better than iterative? And better in what aspects? Like, in terms of messaging overhead? Maybe no need to send multiple (alpha) requests at once? Oh, and much fewer response messages sent back to the requester indeed! Recall why Kad picked alpha = 3, because any node is expected to be possibly dead, but since MaidSafe DHT, by using managed connections, forms a more real-time refreshed network, a node is not expected to be very likely dead, and I guess under such an assumption of superior connectivity, recursive is more efficient. But note that Kad search is already O(log), so not like it's really slow to begin with.

Data, like nodes, are also specialized into various "classes", or "tags", with different e.g. caching policies. This is mainly for the sake of giving applications more semantic and optimized data API, so that the network can be used like a big database or even a big server, rather than just for sharing files.

A detail: the parameter k as in k-buckets (which indicates the expected max simultaneous node failures between refreshes) is decoupled from the data replication factor, so that routing table overhead can be reduced (e.g. k = 2 instead of 20), again because of improved connectivity, yet without degrading storage reliability.

Code Reading

Frankly I didn't know where to get started, being an utterly inexperienced code reader. At first I wanted to locate the logic about the recursive node lookup, but got lost. I had an alternative plan, which was to look for data store logic, because in Kademlia, you have to first find the k closest nodes in order to store that data. But still clueless.

Then I thought of looking at the examples, particularly, routing_minimal.rs, and happily found the simplest use of the routing library. But when I looked at other examples, which eventually led me to bin/sn_node.rs, I realized that besides Routing::new(), there's also Node::new(), and there's definitely more stuff than the routing crate.

After a few more fits of stumbling around, I finally decided to start from the very entry point, namely the "sn_node" module, and trace at least one complete path of call stacks, through-and-through (by that I mean exiting at a raw network call in qp2p). This might serve as a good orientation?

At first though, I found it very hard to keep track of things. After just one day, I would have a hard time recalling. That's when I realized that just reading it through wouldn't work. So I decided to take notes. Initially I used free-form Markdown text, but that didn't scale well, I needed a more organized approach. Then I realized that tree-style note-taking apps (e.g. Dynalist) might work better, and I tried that. It turned out to be a perfect fit. One note on usage is that, a single call stack should be organized hierarchically (parent-child style), namely using indentation, instead of using a flat list (as typically presented in error messages). Why? Because within a single function, there can often be multiple things to do, and each of these is then the root of a separate call stack (that is, multiple subtrees under a single root), so this is where sibling entires come in handy. Essentially, a complete function call tracing requires a tree, not just a list, and that is obvious.

Plus, each note entry has a secondary field that supports multi-line text, and that's perfect for specifying the file location (or module path) for the function, as well as taking explanatory notes.

With this system, I think one can trace from a single entry point any piece of software from end to end, and that is a liberating feeling for an inexperienced code reader. No matter how daunting a massive code base looks, you know there is a way to walk it without getting lost.

I noticed in the tracing that, lots of functions are just abstraction layers calling more abstractions, until finally the real logic appears, so seeing the "meat" of code is the real reward after all the peeling effort. For example, Dispatcher::try_handle_command, and delivery_group::delivery_targets, both in the routing crate, are of substance. In particular, candidates() is essentially the "finding the k closest nodes" logic in Kademlia, except it is quite different from the original indeed, mainly because SN has disjoint sections. The following is from "src/routing/core/delivery_group.rs", as of v0.28.0.

// Obtain the delivery group candidates for this target
fn candidates(
    target_name: &XorName,
    our_name: &XorName,
    section: &Section,
    network: &NetworkPrefixMap,
) -> Result<(Vec<Peer>, usize)> {
    // All sections we know (including our own), sorted by distance to `target_name`.
    let network_sections = network.all();
    let sections = iter::once(section.authority_provider())
        .chain(network_sections.iter())
        .sorted_by(|lhs, rhs| lhs.prefix.cmp_distance(&rhs.prefix, target_name))
        .map(|info| (&info.prefix, info.elder_count(), info.peers()));

    // gives at least 1 honest target among recipients.
    let min_dg_size = 1 + ELDER_SIZE - supermajority(ELDER_SIZE);
    let mut dg_size = min_dg_size;
    let mut candidates = Vec::new();
    for (idx, (prefix, len, connected)) in sections.enumerate() {
        candidates.extend(connected);
        if prefix.matches(target_name) {
            // If we are last hop before final dst, send to all candidates.
            dg_size = len;
        } else {
            // If we don't have enough contacts send to as many as possible
            // up to dg_size of Elders
            dg_size = cmp::min(len, dg_size);
        }
        if len < min_dg_size {
            warn!(
                "Delivery group only {:?} when it should be {:?}",
                len, min_dg_size
            )
        }

        if prefix == section.prefix() {
            // Send to all connected targets so they can forward the message
            candidates.retain(|node| node.name() != our_name);
            dg_size = candidates.len();
            break;
        }
        if idx == 0 && candidates.len() >= dg_size {
            // can deliver to enough of the closest section
            break;
        }
    }
    candidates.sort_by(|lhs, rhs| target_name.cmp_distance(lhs.name(), rhs.name()));

    if dg_size > 0 && candidates.len() >= dg_size {
        Ok((candidates, dg_size))
    } else {
        Err(Error::CannotRoute)
    }
}

To be honest, I found it not as neat/elegant as I expected, in the sense that it's hard to exhaust all the cases in the head to check whether it's always doing the reasonable thing in all possible situations. But I guess not all practical business can be reduced to a "bite-size" treat. After all, that's why people write tests, isn't it? We are just not naturally good at catching bugs in such slippery places.

In addition, I found a few big enums that could serve as a good map of what the network can do overall, such as Event (mapped into NodeTask, a simple wrapper), NodeDuty (which includes sending two message enums, SystemMsg and ServiceMsg), and Command (what to do in response to various events). Essentially this is what the main event loop is all about.

In contrast, the initiation/setup code, namely the various new() functions, are less complex. One curious thing I read in the SN Primer was that upon a node joining the network (bootstrapping), it receives the section's routing table, all in one go. This is quite different from Kademlia, in which a newly joined node has to send a self-lookup request to start building its own routing table gradually.

So where is this logic? In Routing::new(), the joining node calls join_network() located in "routing/core/bootstrap/join.rs" (eventually calling Join::join() as the meat), which returns a Section in addition to a node. This "Section", defined in "src/messaging/system/section", contains the info (ID, IP) of all the member nodes in the section, and this is the "section's routing table".

pub struct Section {
    pub genesis_key: BlsPublicKey,
    pub chain: SecuredLinkedList,
    pub section_auth: SectionAuth<SectionAuthorityProvider>,
    pub members: SectionPeers,
}

Again, because of having introduced the super-node (elder) network (letting a small group of elders govern each size-capped section), storing and sharing the entire routing table of the section becomes predictably inexpensive. From what I see, non-elders (adults) only know about the members of its own section, only the elders are aware of the other sections beyond.

Finally, in terms of code organization, not everything is neatly placed (which I think is very hard to do perfectly for a big project like this). Quite a few tightly related functions, for instance, are spread across the routing and messaging crates, but it's not always obvious why a certain logic should be placed in one crate instead of the other, e.g. Section and related code are defined in messaging, as opposed to routing, which at least conceptually is hard to understand. But of course, there could be very likely strong reasons that led to the decision, all I'm saying is, for newbies at least, jumping back and forth between these crates in order to follow a logic can be taxing. But overall, I'm very glad that the team have made the big decision to combine all the code into this newly created uber-repo named safe_network, which, in addition to the main reason of eliminating version mismatches, freed us all from having to clone a dozen of separate crates to study the code base. As a bonus, this merge also seemed to result in a significant reduction of the total size of the target folder down to a mere half a gig or so, whereas previously, just compiling a single crate could generate more than 2 gigs of artifacts.