docs/blueprints: update tcp_fastopen

Change-Id: I8d9121c38853103b07514d8e55c3bf06049283fd
2023-03-16 14:18:22 -04:00 · 2023-03-16 14:18:22 -04:00 · 0208eb7693
commit 0208eb7693
parent fb12717256
1 changed files with 145 additions and 34 deletions
--- a/docs/blueprints/tcp-fastopen.md
+++ b/docs/blueprints/tcp-fastopen.md
@ -3,8 +3,9 @@
 ## Abstract

 This design doc discusses how we can achieve a large connection set up
-performance improvement by selectively enabling TCP_FASTOPEN when it makes
-sense to do so.
+performance improvement by dialing TCP connections with and without
+TCP_FASTOPEN enabled, selectively choosing the connection that established
+the fastest.

 ## Background/context

@ -42,15 +43,6 @@ too badly:
  out TCP_FASTOPEN, especially as the IANA's allocated option number differed
  from the experimental number. But it's 2023, and as far as I can tell
  everything uses the IANA allocated number now, so this is fine.
-* Middleboxes: it's true that due to middleboxes dropping TCP packets they don't
-  understand, trying to use TCP_FASTOPEN may mean that some connections appear
-  to die and never work. Evidently Chrome's attempts to use TCP_FASTOPEN were
-  too challenging to keep track of which routes supported TCP_FASTOPEN and which
-  didn't. But we have a different context! We don't need all routes to support
-  TCP_FASTOPEN, just a majority of them. We may not want to use TCP_FASTOPEN
-  between the Uplink and the Satellite in general, but we could likely reliably
-  use it between the Gateway-MT and the Satellite, and between Uplinks and
-  storage nodes. It's also 2023, and more middleboxes are likely to not be dumb.
 * Tracking concerns: We use TLS connections with certificates (or Noise
  connections with stable public keys), even with Uplinks. Any tracking that
  TCP_FASTOPEN enables is likely already possible. This is a concern we should
@ -61,6 +53,12 @@ too badly:
  trying to use them! But this still affects us significantly, and HTTP/3 and
  TLS1.3 do not help us enough for us to not remain interested in this.

+However, the fourth bullet point is a huge sticking point for us:
+
+* Middleboxes: middleboxes drop TCP packets they don't understand, which means
+  some connections will appear to die and never work when trying to use
+  TCP_FASTOPEN. See "preliminary tests" below.
+
 ### Preliminary tests

 So, I tried enabling this on a global test network of Uplinks, Satellites, and
@ -76,11 +74,53 @@ And I added these patchsets:
 * https://review.dev.storj.io/c/storj/storj/+/9251 (private/server: support tcp_fastopen)

 It worked great! Awesome even! Shaved 150ms off most operations.
-
 My review: A+ would enable again.

+So, to try and get a better understanding of community support, we
+submitted an earlier version of the design draft with a [Python
+script to assist testing](https://gist.github.com/jtolio/57af177ae1fe5f0f255214b2c2ef90a1)
+and failures were widespread. Many storage node operators had difficulty
+using TCP packets with the FASTOPEN headers, and a wide variety of
+indeterminate issues showed up. Lots of connections timed out and died.
+Some operators had success, but many operators did not.
+
+Worse - storage node operators pointed out that they don't have any
+incentive to enable TCP_FASTOPEN since it may reduce the amount of uploads
+they would otherwise receive, for all uploads that use a network path that
+drops TCP_FASTOPEN packets. We need storage node operators to enable
+kernel support, so we need a downside-free option for storage node operators.
+
+So we can't rely on TCP_FASTOPEN working by itself.
+
 ## Design and implementation

+But we can just try both, like we do with QUIC!
+
+Here's the broad plan - we are going to just dial both standard TCP and
+TCP_FASTOPEN in parallel for every connection and then pick the one that
+worked faster. This will double the amount of dials we do, but won't
+increase the amount of long-lived connections.
+
+It turns out this is actually pretty complicated if Noise is in the
+picture as well. With TCP_FASTOPEN and Noise together, the very first packet
+will need to have request data inside of it, which means that we will
+need to duplicate the request down to both sockets. This might not be
+safe for the application.
+
+We're going to focus on enabling this for both TLS and Noise, but not
+for unencrypted connections.
+
+So, the moving parts for this to work are:
+
+ * Server-side socket settings and kernel config
+ * Server-side request debouncing
+ * Keeping track of what servers support debouncing
+ * Duplicating dials and outgoing request writes (but not more than that!) and selecting
+   a connection.
+ * Client-side socket settings
+
+### Socket settings and kernel config
+
 So, we have to call Setsockopt on storage nodes and Satellites on the listening
 socket to enable TCP_FASTOPEN, and we may need to tell the kernel with `sysctl`
 to allow servers to use TCP_FASTOPEN.
@ -90,42 +130,113 @@ The first step is accomplished with a small change
 an operator step that we will need storage node operators to do and include
 in our setup instructions.

-You can see if TCP_FASTOPEN is working on the client side by running:
 ```
-ip tcp_metrics show|grep cookie
+sysctl -w net.ipv4.tcp_fastopen=3
 ```

+This may also need to be persisted in `/etc/sysctl.conf` or `/etc/sysctl.d`.
+
+### Server-side request debouncing
+
+The two types of messages we might see come off the network on a new socket
+are a TLS client hello or a first Noise message.
+
+In either case, we can quickly hash that message and see if we've already
+seen that hash before. If we have, that means this duplicate message
+came in second and we can close the socket and throw the message away.
+Note that this does not provide any security guarantees or replay-attack
+safety.
+
+All TCP packets are subject to the IP
+[TTL field](https://en.wikipedia.org/wiki/Time_to_live). In IPv4, it is
+designed as a maximum time that a packet might live in the network, and
+it should not last longer than something like 4 minutes. However, in
+practice, the TTL field does not consider time and instead has often
+been implemented as more of a hop-limit. IPv6 has been updated to
+reflect its use as a hop limit. All that said, we can deduplicate
+practically all second messages we receive with a small cache with a
+memory on the order of 10 minutes.
+
+Because it is not provably all packets, we should only use TCP_FASTOPEN
+with Noise on replay safe requests (which we must do anyway with Noise_IK),
+and we should only duplicate the TLS client hello with TCP_FASTOPEN and
+not duplicate anything beyond it.
+
+In summary: a small cache that considers the first message hashes
+specifically of the Noise first packet or the TLS client hello, rejecting
+duplicates, should be all we need here.
+
+Relevant reviews:
+ * https://review.dev.storj.io/c/storj/storj/+/9763
+
+### Keeping track of what servers support debouncing
+
+Once a server (node or Satellite) supports message debouncing, we need
+to keep track of it, so we don't overwhelm nodes with duplicate messages
+that don't know how to handle it efficiently.
+
+This is pretty straightforward - we need to keep track of debounce
+support per node in the Satellite DB.
+
+We will simply assume at some later point that all Satellites support
+debouncing.
+
+Relevant reviews:
+ * https://review.dev.storj.io/c/storj/common/+/9778
+ * https://review.dev.storj.io/c/storj/storj/+/9779
+ * https://review.dev.storj.io/c/storj/uplink/+/9930
+
+### Duplicating dials and outgoing request writes
+
+This is the hardest part of this plan.
+
+Whenever we need a new connection to a peer, we are going to:
+
+ * Immediately return a handle to a multidialer. Dialing will
+   become a no-op and we will start dialing in the background.
+ * In the background, dial one connection with standard TCP
+ * Also backgrounded, dial a separate connection with
+   TCP_FASTOPEN enabled. We will do this initially on Linux only
+   with TCP_FASTOPEN_CONNECT, and then figure out how to use
+   TCP_FASTOPEN. See the platform specific issues section about
+   TCP_FASTOPEN_CONNECT.
+ * Once something tries to write, we will copy the requested
+   bytes and write to both sockets, once ready.
+ * As soon as we need to read data for the first time, that is
+   the cut off point. We will stop copying writes to both sockets.
+   The first read to return data will be the connection we
+   select and we will close the other one.
+
+Relevant reviews:
+ * https://review.dev.storj.io/c/storj/common/+/9858
+ * https://review.dev.storj.io/c/storj/uplink/+/9859
+
+### Client side socket settings
+
 Clients are easier as they evidently don't need the `sysctl` call. They can be
 implemented by calling Setsockopt on the sockets before the TCP dialer
 calls connect, which Go now has functionality to allow (the Control option on
 the net.Dialer). It can be done like this:
 https://review.dev.storj.io/c/storj/common/+/9252

+(but rolled into other changes)
+
+You can see if TCP_FASTOPEN is working on the client side by running:
+```
+ip tcp_metrics show|grep cookie
+```
+
 ### Consideration for clients

-Clients may not be inside a network topology that allows for TCP_FASTOPEN. In
-these cases, clients will likely want to disable the feature. In these
-scenarios, I would imagine we can identify the vast majority of cases with the
-Satellite connection directly. If the Satellite connection has trouble, then we
-should just disable TCP_FASTOPEN use.
-
-Otherwise, if clients are inside a network topology that isn't dropping packets
-with TCP_FASTOPEN, then they benefit the most from a network of storage nodes
-that support it.
+Clients may not be inside a network topology that allows for TCP_FASTOPEN.
+A previous version of this design required the client to do something, but
+by dialing both ways, this should work transparently for the client.

 ### Consideration for SNOs

-SNOs also will not want to enable support for TCP_FASTOPEN unless their network
-topology supports it (most seem to). Luckily, TCP_FASTOPEN is only attempted if
-both the client and server signal that they support TCP_FASTOPEN, so SNOs who
-keep TCP_FASTOPEN disabled won't have a complete failure for clients that are
-trying.
-
-SNOs will have a strong incentive to enable TCP_FASTOPEN if they can though -
-our upload and download races will prefer nodes that finish faster, and
-TCP_FASTOPEN eliminates hundreds of milliseconds of penalty. Nodes that enable
-TCP_FASTOPEN are going to win way more upload/download races. We should help
-node operators set up and configure TCP_FASTOPEN.
+Likewise, a previous version of this design required SNOs to consider whether
+enabling TCP_FASTOPEN was advantageous. With the current design, it always
+is.

 ### Platform specific issues