docs/design: Kademlia Removal Design Proposal (#2686)

This commit is contained in:
Jennifer Li Johnson 2019-08-20 15:48:35 -04:00 committed by GitHub
parent 513563eff3
commit 9df2ec6a37
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23

View File

@ -0,0 +1,89 @@
# Kademlia Removal
## Abstract
This design document outlines the communication protocol between satellites and
storage nodes, network refreshes, and kademlia removal given the satellite opt-in
capability for storage nodes.
## Background
Many peer-to-peer, decentralized systems employ the Kademlia implementation of a distributed hash table to allow for
locating peer nodes, exchanging messages and sharing data. However, due to the nature of our network, we only use Kademlia
for node discovery and address lookups given node IDs. This is useful when satellites dont know about all the nodes in
the network and nodes are unfamiliar with all of the satellites in the network.
With our recent business decision of simplifying the storage node operator user experience, we no longer require kademlia
for node discovery. In a solution called SNO-select, storage nodes operators manually select the satellites they want to work with
and satellites wait for storage nodes to work with them. The initial implementation of this solution allows SNOs to update
their trusted satellite list in a hardcoded configuration file, but future improvements will enable users to manage
this list through a web console.
We will replace our Kademlia DHT and related entities with direct communication between satellites and storage nodes,
and keep the network fresh without kademlia node discovery and random lookups.
## Design
### Nodes reach out to satellites that they want to work with
- The satellites are listed in the trust package
- Nodes should communicate with satellites directly rather than using kademlia to traverse the network to find the address of a given ID.
- Storage nodes should notify satellites when they start up, wait a random amount of time (to add jitter
http://highscalability.com/blog/2012/4/17/youtube-strategy-adding-jitter-isnt-a-bug.html), then start reporting in roughly on the hour
### Network refreshes at a regular interval
- Nodes will keep themselves up to date in the network by pinging all the satellites in their
trusted list every hour.
- Satellites will ping the nodes back to confirm their addresses
- If is it successful, the satellite will insert or update the node in the overlay cache and
notify the node of success. Make sure to close the connection. Dont use the transport observer to update the cache.
Update the IP and uptime directly.
- If the satellite does not confirm the node address, it does not proceed with updating the overlay cache. The node
receives a log message and closes the connection when it times out.
### Disintegrate Kademlia from the network, storj sim and testplanet setups
- Remove kademlia from the discovery package
- Remove the bootstrap node - work with Ops
- Remove the vouchers service and related tables
- Work with QA to make sure storage nodes dont crash on errors related to the elimination of Kademlia
- if they dont update immediately -> keep just the overlay.Ping rpc method, it will be much easier for a new satellite
to work with old and new storage nodes.
### Update whitepaper to address kademlia removal and the addition of satellite opt-in
- Delete the audit gating design doc
- Update the wiki
## Implementation
- [Nodes should communicate with satellites directly](https://storjlabs.atlassian.net/browse/V3-2274)
- [Network refreshes at a regular interval](https://storjlabs.atlassian.net/browse/V3-2275)
- [Remove the overlay cache from transport observers](https://storjlabs.atlassian.net/browse/V3-2305])
- [Delete Kademlia](https://storjlabs.atlassian.net/browse/V3-2276)
- [Update Documentation](https://storjlabs.atlassian.net/browse/V3-2461)
## Future considerations
### Selected satellite management
- Currently, a storage node operator can input a list of satellite IDs and addresses into their configuration file on setup.
Several tardigrade-level satellites are included by default.
- Next steps are to allow users to modify their selected satellites list through a web based console.
- The satellite list will need to be stored in a sql table or equivalent for persistence
### NodeID updates
- Is there anything that we should redesign regarding the nodeID and node data structures?
- Do we need the node dossier any longer?
### Retiring the Transport Observer
- Since the routing table and the overlay cache are two of the main features that use the transport observer, and we move to directly
update the overlay cache, we can remove the transport observer and remaining dependent services.
- This would simplify uptime checks, and the node uptime column would be updated less frequently.
### Node -> satellite communication initiation
- To save resources and improve performance, satellites can have a “tip box” to receive UDP messages about new nodes
- UDP messages need the address of the node, the certificate chain to be expected once talking to the node that identifies
the node, and a signature of the above things with the leaf private key of that certificate chain.
- Messages will ultimately be ignored if the difficulty of the computed ID isn't high enough or the node you end up talking
to doesn't have the same node id as the one computed from the tipster certificate chain.