docs/design: Kademlia Removal Design Proposal (#2686)
This commit is contained in:
parent
513563eff3
commit
9df2ec6a37
89
docs/design/kademlia-removal.md
Normal file
89
docs/design/kademlia-removal.md
Normal file
@ -0,0 +1,89 @@
|
||||
# Kademlia Removal
|
||||
|
||||
## Abstract
|
||||
|
||||
This design document outlines the communication protocol between satellites and
|
||||
storage nodes, network refreshes, and kademlia removal given the satellite opt-in
|
||||
capability for storage nodes.
|
||||
|
||||
## Background
|
||||
|
||||
Many peer-to-peer, decentralized systems employ the Kademlia implementation of a distributed hash table to allow for
|
||||
locating peer nodes, exchanging messages and sharing data. However, due to the nature of our network, we only use Kademlia
|
||||
for node discovery and address lookups given node IDs. This is useful when satellites don’t know about all the nodes in
|
||||
the network and nodes are unfamiliar with all of the satellites in the network.
|
||||
|
||||
With our recent business decision of simplifying the storage node operator user experience, we no longer require kademlia
|
||||
for node discovery. In a solution called SNO-select, storage nodes operators manually select the satellites they want to work with
|
||||
and satellites wait for storage nodes to work with them. The initial implementation of this solution allows SNOs to update
|
||||
their trusted satellite list in a hardcoded configuration file, but future improvements will enable users to manage
|
||||
this list through a web console.
|
||||
|
||||
We will replace our Kademlia DHT and related entities with direct communication between satellites and storage nodes,
|
||||
and keep the network fresh without kademlia node discovery and random lookups.
|
||||
|
||||
## Design
|
||||
|
||||
### Nodes reach out to satellites that they want to work with
|
||||
- The satellites are listed in the trust package
|
||||
- Nodes should communicate with satellites directly rather than using kademlia to traverse the network to find the address of a given ID.
|
||||
- Storage nodes should notify satellites when they start up, wait a random amount of time (to add jitter
|
||||
http://highscalability.com/blog/2012/4/17/youtube-strategy-adding-jitter-isnt-a-bug.html), then start reporting in roughly on the hour
|
||||
|
||||
### Network refreshes at a regular interval
|
||||
- Nodes will keep themselves up to date in the network by pinging all the satellites in their
|
||||
trusted list every hour.
|
||||
- Satellites will ping the nodes back to confirm their addresses
|
||||
- If is it successful, the satellite will insert or update the node in the overlay cache and
|
||||
notify the node of success. Make sure to close the connection. Don’t use the transport observer to update the cache.
|
||||
Update the IP and uptime directly.
|
||||
- If the satellite does not confirm the node address, it does not proceed with updating the overlay cache. The node
|
||||
receives a log message and closes the connection when it times out.
|
||||
|
||||
### Disintegrate Kademlia from the network, storj sim and testplanet setups
|
||||
- Remove kademlia from the discovery package
|
||||
- Remove the bootstrap node - work with Ops
|
||||
- Remove the vouchers service and related tables
|
||||
- Work with QA to make sure storage nodes don’t crash on errors related to the elimination of Kademlia
|
||||
- if they don’t update immediately -> keep just the overlay.Ping rpc method, it will be much easier for a new satellite
|
||||
to work with old and new storage nodes.
|
||||
|
||||
### Update whitepaper to address kademlia removal and the addition of satellite opt-in
|
||||
- Delete the audit gating design doc
|
||||
- Update the wiki
|
||||
|
||||
## Implementation
|
||||
|
||||
- [Nodes should communicate with satellites directly](https://storjlabs.atlassian.net/browse/V3-2274)
|
||||
|
||||
- [Network refreshes at a regular interval](https://storjlabs.atlassian.net/browse/V3-2275)
|
||||
|
||||
- [Remove the overlay cache from transport observers](https://storjlabs.atlassian.net/browse/V3-2305])
|
||||
|
||||
- [Delete Kademlia](https://storjlabs.atlassian.net/browse/V3-2276)
|
||||
|
||||
- [Update Documentation](https://storjlabs.atlassian.net/browse/V3-2461)
|
||||
|
||||
## Future considerations
|
||||
|
||||
### Selected satellite management
|
||||
- Currently, a storage node operator can input a list of satellite IDs and addresses into their configuration file on setup.
|
||||
Several tardigrade-level satellites are included by default.
|
||||
- Next steps are to allow users to modify their selected satellites list through a web based console.
|
||||
- The satellite list will need to be stored in a sql table or equivalent for persistence
|
||||
|
||||
### NodeID updates
|
||||
- Is there anything that we should redesign regarding the nodeID and node data structures?
|
||||
- Do we need the node dossier any longer?
|
||||
|
||||
### Retiring the Transport Observer
|
||||
- Since the routing table and the overlay cache are two of the main features that use the transport observer, and we move to directly
|
||||
update the overlay cache, we can remove the transport observer and remaining dependent services.
|
||||
- This would simplify uptime checks, and the node uptime column would be updated less frequently.
|
||||
|
||||
### Node -> satellite communication initiation
|
||||
- To save resources and improve performance, satellites can have a “tip box” to receive UDP messages about new nodes
|
||||
- UDP messages need the address of the node, the certificate chain to be expected once talking to the node that identifies
|
||||
the node, and a signature of the above things with the leaf private key of that certificate chain.
|
||||
- Messages will ultimately be ignored if the difficulty of the computed ID isn't high enough or the node you end up talking
|
||||
to doesn't have the same node id as the one computed from the tipster certificate chain.
|
Loading…
Reference in New Issue
Block a user