design/docs: add successful pingback to kademlia removal document (#2837)

2019-08-26 15:34:04 +03:00 · 2019-08-26 15:34:04 +03:00 · 36c9d569ff
commit 36c9d569ff
parent 051052307d
1 changed files with 83 additions and 58 deletions
--- a/docs/design/kademlia-removal.md
+++ b/docs/design/kademlia-removal.md
@ -2,88 +2,113 @@

 ## Abstract

-This design document outlines the communication protocol between satellites and
-storage nodes, network refreshes, and kademlia removal given the satellite opt-in
-capability for storage nodes.
+This design document outlines communication protocol between satellites and
+storage nodes, network refreshes, and kademlia removal.

 ## Background

-Many peer-to-peer, decentralized systems employ the Kademlia implementation of a distributed hash table to allow for 
-locating peer nodes, exchanging messages and sharing data. However, due to the nature of our network, we only use Kademlia 
-for node discovery and address lookups given node IDs. This is useful when satellites don’t know about all the nodes in 
-the network and nodes are unfamiliar with all of the satellites in the network.
+Many decentralized systems use Kademlia distributed hash table to find peers, exchange messages, and share data.
+However, Storj only needs it for discovery and address lookups. 
+Kademlia is useful when satellites don’t know about all the nodes in the network and nodes don't know the satellites.

-With our recent business decision of simplifying the storage node operator user experience, we no longer require kademlia 
-for node discovery. In a solution called SNO-select, storage nodes operators manually select the satellites they want to work with 
-and satellites wait for storage nodes to work with them. The initial implementation of this solution allows SNOs to update 
-their trusted satellite list in a hardcoded configuration file, but future improvements will enable users to manage
-this list through a web console. 
+To improve user experience, we decided to use an opt-in to satellites.
+Opt-in means each storage node operator selects the satellites they want to work with.
+As a result, storage nodes can notify satellites, without discovery.

-We will replace our Kademlia DHT and related entities with direct communication between satellites and storage nodes, 
-and keep the network fresh without kademlia node discovery and random lookups.
+The initial implementation has a hardcoded list of satellites in a configuration file.
+Future improvements would add capabilities to manage the list dynamically.
+
+As a result of all these decisions, we can remove Kademlia and replace it with direct communication.

 ## Design

-### Nodes reach out to satellites that they want to work with
- The satellites are listed in the trust package
- Nodes should communicate with satellites directly rather than using kademlia to traverse the network to find the address of a given ID.
- Storage nodes should notify satellites when they start up, wait a random amount of time (to add jitter 
-http://highscalability.com/blog/2012/4/17/youtube-strategy-adding-jitter-isnt-a-bug.html), then start reporting in roughly on the hour
+To replace Kademlia, we need to complete several things:

-### Network refreshes at a regular interval
- Nodes will keep themselves up to date in the network by pinging all the satellites in their
-   trusted list every hour.
- Satellites will ping the nodes back to confirm their addresses
-    - If is it successful, the satellite will insert or update the node in the overlay cache and
-       notify the node of success. Make sure to close the connection. Don’t use the transport observer to update the cache.
-       Update the IP and uptime directly.
-    - If the satellite does not confirm the node address, it does not proceed with updating the overlay cache. The node 
-    receives a log message and closes the connection when it times out.
+- replace storage node initial communication with the satellites,
+- replace network refreshing,
+- remove kademlia from services,
+- update documentation.

-### Disintegrate Kademlia from the network, storj sim and testplanet setups
- Remove kademlia from the discovery package
- Remove the bootstrap node - work with Ops
-  - Remove the vouchers service and related tables
-  - Work with QA to make sure storage nodes don’t crash on errors related to the elimination of Kademlia
- if they don’t update immediately ->  keep just the overlay.Ping rpc method, it will be much easier for a new satellite 
-to work with old and new storage nodes.

-### Update whitepaper to address kademlia removal and the addition of satellite opt-in
-  - Delete the audit gating design doc
-  - Update the wiki
+### Storage Node initial communication
+
+Storage node should connect satellites in their trusted list and notify that they want to work with them.
+
+We need to ensure that we do not overload the satellite during upgrades.
+Hence, we need to add jitter for refreshes and initial communication.
+
+_For more information on jitter see http://highscalability.com/blog/2012/4/17/youtube-strategy-adding-jitter-isnt-a-bug.html ._
+
+### Network refreshing
+
+Storage Nodes keep themselves up to date in the network by pinging all the satellites in their trusted list. Refreshes would happen every 1 hour.
+Satellites, in response, will ping the nodes to confirm their address and ensure that the network is configured correctly.
+
+When a Satellite has successfully pinged the storage node, it will update IP and uptime in overlay.
+On failure, the satellite does not update overlay and notifies the storage node.
+
+Storage Node keeps track of this information, such that Storage Node operator can notice the problem.
+
+We consider a successful ping when a node with node-ID `N` has contacted satellite `S` and claimed its address is `A`,
+
+1. `S` _must_ initiate a network connection `C` to address `A`
+2. `S` _must_ verify that the remote endpoint on `C` has a private key corresponding to public key/node ID `N`. (It would be sufficient to complete negotiation of an SSL session over `C` and then verify that the remote end is using the same certificate used by `N` in the initial incoming ping.)
+3. `S` _should_ verify that the remote endpoint on `C` agrees that its address is `A`. (This doesn't seem strictly necessary for security, but could prevent misconfigurations where a storage node operator runs multiple nodes with the same identity.)
+4. `S` _should_ respond to the initial incoming ping from `N` with the result of the pingback, so that the dashboard on that node can report whether the node can receive incoming connections.
+
+
+### Kademlia Removal
+
+Once we have replaced the necessary pieces, we can remove Kademlia from the codebase:
+
+- remove kademlia from discovery package,
+- remove bootstrap node and server, and
+- remove vouchers.
+
+During all of these removals, we need to ensure that existing nodes do not break during upgrades.
+It might be easier to replace endpoints with stubs, such that existing calls keep working until the network is fully upgraded.
+
+### Update Documentation
+
+We should update our documents with this major design change.
+We need to update the whitepaper, audit gating design document, and wiki.

 ## Implementation

 - [Nodes should communicate with satellites directly](https://storjlabs.atlassian.net/browse/V3-2274)
-
 - [Network refreshes at a regular interval](https://storjlabs.atlassian.net/browse/V3-2275)
-
 - [Remove the overlay cache from transport observers](https://storjlabs.atlassian.net/browse/V3-2305])
-
 - [Delete Kademlia](https://storjlabs.atlassian.net/browse/V3-2276)
-
 - [Update Documentation](https://storjlabs.atlassian.net/browse/V3-2461)

 ## Future considerations

-### Selected satellite management
- Currently, a storage node operator can input a list of satellite IDs and addresses into their configuration file on setup. 
-Several tardigrade-level satellites are included by default. 
- Next steps are to allow users to modify their selected satellites list through a web based console.
- The satellite list will need to be stored in a sql table or equivalent for persistence
+### Satellite Management User Interface
+
+Currently, Storage Node Operator can specify the list of satellites in the configuration.
+Changing this configuration requires a restart and is not convenient.
+
+Next steps would be to have a satellite management interface in the web-based console.
+This means we need to store the satellite list in a database.

 ### NodeID updates
- Is there anything that we should redesign regarding the nodeID and node data structures? 
- Do we need the node dossier any longer?
+
+We should review whether we want to change NodeID or related data structures.
+We may be able to simplify them further. As an example, it might be possible to remove NodeDossier.

 ### Retiring the Transport Observer
- Since the routing table and the overlay cache are two of the main features that use the transport observer, and we move to directly 
-update the overlay cache, we can remove the transport observer and remaining dependent services.
- This would simplify uptime checks, and the node uptime column would be updated less frequently.

-### Node -> satellite communication initiation
- To save resources and improve performance, satellites can have a “tip box” to receive UDP messages about new nodes
- UDP messages need the address of the node, the certificate chain to be expected once talking to the node that identifies 
-the node, and a signature of the above things with the leaf private key of that certificate chain.
- Messages will ultimately be ignored if the difficulty of the computed ID isn't high enough or the node you end up talking 
-to doesn't have the same node id as the one computed from the tipster certificate chain.
+Transport observers currently update routing table and overlay during each connection.
+Since routing table will be removed together with Kademlia, we can also simplify this design.
+
+We can update uptime without using hooks, allowing to remove transport observer and hooks from the codebase.
+It would also allow more clearly handle batching of uptime updates.
+
+### Performant Storage Node and Satellite Pinging
+
+We can reduce bandwidth and improve performance with UDP pinging.
+Satellites can have a “tip box” to receive UDP messages about new nodes.
+UDP messages need to be signed by the node and contain the address and the certificate chain.
+Node ID-s with low difficulty are ignored.
+
+Satellites regularly ping back the storage nodes and use the same protocol as described above.