storj/docs/blueprints/kademlia-removal.md
JT Olio c81e4fcb9e
design/blueprints (#2927)
we've decided to rename design docs to blueprints to indicate that we don't intend to keep blueprints up to date once completed. instead, the final step of blueprint implementation is to archive the blueprint and update actual documentation elsewhere (potentially copying and pasting most of the blueprint). this PR updates the template and readme
2019-09-09 08:55:05 -06:00

5.4 KiB
Raw Permalink Blame History

Kademlia Removal

Abstract

This design document outlines communication protocol between satellites and storage nodes, network refreshes, and kademlia removal.

Background

Many decentralized systems use Kademlia distributed hash table to find peers, exchange messages, and share data. However, Storj only needs it for discovery and address lookups. Kademlia is useful when satellites dont know about all the nodes in the network and nodes don't know the satellites.

To improve user experience, we decided to use an opt-in to satellites. Opt-in means each storage node operator selects the satellites they want to work with. As a result, storage nodes can notify satellites, without discovery.

The initial implementation has a hardcoded list of satellites in a configuration file. Future improvements would add capabilities to manage the list dynamically.

As a result of all these decisions, we can remove Kademlia and replace it with direct communication.

Design

To replace Kademlia, we need to complete several things:

  • replace storage node initial communication with the satellites,
  • replace network refreshing,
  • remove kademlia from services,
  • update documentation.

Storage Node initial communication

Storage node should connect satellites in their trusted list and notify that they want to work with them.

We need to ensure that we do not overload the satellite during upgrades. Hence, we need to add jitter for refreshes and initial communication.

For more information on jitter see http://highscalability.com/blog/2012/4/17/youtube-strategy-adding-jitter-isnt-a-bug.html .

Network refreshing

Storage Nodes keep themselves up to date in the network by pinging all the satellites in their trusted list. Refreshes would happen every 1 hour. Satellites, in response, will ping the nodes to confirm their address and ensure that the network is configured correctly.

When a Satellite has successfully pinged the storage node, it will update IP and uptime in overlay. On failure, the satellite does not update overlay and notifies the storage node.

Storage Node keeps track of this information, such that Storage Node operator can notice the problem.

We consider a successful ping when a node with node-ID N has contacted satellite S and claimed its address is A,

  1. S must initiate a network connection C to address A
  2. S must verify that the remote endpoint on C has a private key corresponding to public key/node ID N. (It would be sufficient to complete negotiation of an SSL session over C and then verify that the remote end is using the same certificate used by N in the initial incoming ping.)
  3. S should verify that the remote endpoint on C agrees that its address is A. (This doesn't seem strictly necessary for security, but could prevent misconfigurations where a storage node operator runs multiple nodes with the same identity.)
  4. S should respond to the initial incoming ping from N with the result of the pingback, so that the dashboard on that node can report whether the node can receive incoming connections.

Kademlia Removal

Once we have replaced the necessary pieces, we can remove Kademlia from the codebase:

  • remove kademlia from discovery package,
  • remove bootstrap node and server, and
  • remove vouchers.

During all of these removals, we need to ensure that existing nodes do not break during upgrades. It might be easier to replace endpoints with stubs, such that existing calls keep working until the network is fully upgraded.

Update Documentation

We should update our documents with this major design change. We need to update the whitepaper, audit gating design document, and wiki.

Implementation

Future considerations

Satellite Management User Interface

Currently, Storage Node Operator can specify the list of satellites in the configuration. Changing this configuration requires a restart and is not convenient.

Next steps would be to have a satellite management interface in the web-based console. This means we need to store the satellite list in a database.

NodeID updates

We should review whether we want to change NodeID or related data structures. We may be able to simplify them further. As an example, it might be possible to remove NodeDossier.

Retiring the Transport Observer

Transport observers currently update routing table and overlay during each connection. Since routing table will be removed together with Kademlia, we can also simplify this design.

We can update uptime without using hooks, allowing to remove transport observer and hooks from the codebase. It would also allow more clearly handle batching of uptime updates.

Performant Storage Node and Satellite Pinging

We can reduce bandwidth and improve performance with UDP pinging. Satellites can have a “tip box” to receive UDP messages about new nodes. UDP messages need to be signed by the node and contain the address and the certificate chain. Node ID-s with low difficulty are ignored.

Satellites regularly ping back the storage nodes and use the same protocol as described above.