JT Olio b022f371d2 blueprint: tcp fastopen

Change-Id: I20fc843b916b5ae0b72acfa5a03cf75f211739c0

2023-01-17 20:35:47 +00:00

28 KiB

Raw Permalink Blame History

Noise over TCP (uplink to storage node)

Abstract

This design doc discusses how we can achieve a large connection set up performance improvement by exchanging our use of TLS for Noise_IK.

This design doc is scoped to communication between the Uplink and storage nodes only.

Background/context

The problem

When compared to traditional datacenter storage platforms, Storj is significantly more exposed to issues arising from distance. In particular, whenever two peers need to exchange data in a round trip, the speed of light is a serious concern. In a datacenter, a round trip between two nodes may take 500 us, or .5 ms. In the Storj context, a round trip to a node across the ocean may take 150 ms, 300 times slower. There is no way to fix this, this is a fundamental physical law.

Because datacenter round trips are so fast, many protocols have not been designed with an extreme allergy to round trips, but we must be. Every time we wait for a packet round trip, we are adding 150ms, which for certain operations, is our entire performance budget.

To do a small download, say 5KB for example, right now the Uplink will:

first establish a TCP connection with the Satellite (one packet round trip, TCP SYN, then TCP ACK),
then establish a TLS session over that TCP stream (another packet round trip, TLS Client Hello, then TLS Server Hello),
then send the request and receive a satellite response (a third round trip).
Now, the Uplink is finally able to start making requests to nodes! It has to do this all again.
first, establish a TCP connection with each node (packet round trip)
then, establish a TLS session over TCP (packet round trip)
then send the RPC request over DRPC (packet round trip)
THEN send the order (it's not part of the same request currently!!) (strictly speaking, this isn't inherently another round trip, though this behavior does significantly complicate using Noise, which requires a response to handshake packet 1 before sending packet 2).
finally, the node will start returning data.

If the Uplink is near the Satellite, the Satellite operations won't be the worst part (though, considering most Uplinks don't operate in the same datacenter as the Satellite, and the Satellite itself has inter-dc penalities due to Cockroach coordination, it's not cheap). But most node operations do have a large amount of traffic that goes over an ocean, and many node operations are not able to be effectively connection pooled (though we've tried). Because the overall request goes as fast as the slowest of 29 downloads, the likelihood that at least one of the required nodes has a high round trip cost is high, and so this is what is killing us.

QUIC

Our first attempt to fix this was to roll out QUIC. QUIC is a TCP-like protocol built on top of UDP packets (TCP is a protocol built on top of IP packets). QUIC's solution is to combine the TLS and the TCP handshake into one.

Marton has proven that our QUIC project did indeed pay off, most of the time. QUIC makes our overall performance between peers significantly faster due to the elimination of just one handshake.

Unfortunately, having QUIC based on UDP means that slightly more requests fail than with TCP (perhaps bad node operator setup, middleware dropping packets, unclear). This means that our long tail cancelation has to wait for more nodes, which makes us more susceptible to slowness. So overall, QUIC has worse long tail variability, and is slower at higher percentiles.

It is currently not enabled by default. We still intend to use QUIC or UDP more generally, but perhaps we can do something else sooner.

Do we need the TLS handshake?

QUIC eliminates one handshake by combining the TCP and TLS handshakes, but the TLS handshake at all is a bummer. The reason the TLS handshake exists is for both peers to exchange cryptographic information and confirm that they are talking to the right peer over the right protocol, doing a protocol negotiation.

But we don't need that, certainly for nodes!

Storage nodes are already checking in with Satellites, and this provides an opportunity for storage nodes to provide all of their cryptographic configuration settings (public key, etc) in advance. When the Satellite tells the Uplink which nodes to talk to, it can give the Uplink everything it needs to skip the cryptographic handshake.

So, we should try and eliminate the cryptographic handshake altogether. We don't need it, and we can avoid it, at least between Uplinks and nodes, in all cases.

Do we need any handshakes?

If we don't need the TLS handshake, do we really even need the TCP handshake? Also no. There's no reason the very first SYN packet for connection setup can't include the request data, so that the very second packet back (what would have been the ACK packet) can include the first bytes of the response. This would take our current situation of 3 round trips between the Uplink and each node to 1, which saves 100-150ms off for each round trip eliminated.

Including data in the SYN packet is precisely what TCP_FASTOPEN does, which we could enable, if we don't end up using something UDP-based.

For the purposes of this design doc though, suffice it to say that if this is our ideal, we need to eliminate the TLS handshake entirely in almost all cases.

What about the Satellite?

This design doc is focusing on the Uplink to storage node communication as a way to get a foot in the door. Solving Uplink to Satellite communication is more challenging, as we don't have a clean way of getting the Uplink the Satellite's current Noise public key in advance. That is of course interesting but saved for a future design doc. Satellite to storage node communications are not in the hotpath and are thus not as high priority.

Luckily, the Uplink to Satellite communication flow benefits the most from connection pooling.

Design

The Noise Framework

The Noise protocol framework is a relatively new entrant in the cryptographic communication space. Using cryptographic design primitives like the ratchet system Signal uses, the Noise protocol framework provides a tight, straightforward set of cryptographic building blocks for building your own TLS. In fact, the Noise protocol is what Wireguard uses.

A great intro to the Noise framework is this presentation by its creator: https://www.youtube.com/watch?v=3gipxdJ22iM. The design doc is also fantastic: https://noiseprotocol.org/noise.html

The Noise framework prioritizes simplicity and security above all else.

One of the ways the Noise framework radically reduces complexity is by eliminating all of the negotiation features that TLS has. In TLS, the client and the server negotiate on which cryptographic primitives can and should be used for the session, and the client and server can choose based on what's available between peers.

Noise does away with this. When you implement Noise, you pin down your algorithms. Since Noise is a framework, here are some example Noise protocols within the framework:

Noise_XX_25519_AESGCM_SHA256
Noise_N_25519_ChaChaPoly_BLAKE2s
Noise_IK_448_ChaChaPoly_BLAKE2b

You choose one of these at connection dial time, and then there are no places in the code that allow a session to negotiate or switch. Both peers must agree on the protocol in advance.

The Noise framework also lets you choose between a set list of potential "handshake patterns" which describe what is exchanged in packets, when, and what security properties you get. There are many possible handshake patterns, but for full duplex protocols, the Noise authors recommend XX or IK only. XX is much like TLS in that there is a handshake to exchange keys first. IK is a 0-RTT handshake and requires the keys exchanged in advance, which we can do. Wireguard uses Noise_IK. Noise_IK is the handshake pattern that solves our problems here.

The one (and only?) downside to Noise_IK is that it opens us up to replay attacks. See the Open issues section.

We want to use Noise_IK. We probably want to benchmark and pick between:

Noise_IK_25519_ChaChaPoly_BLAKE2b
Noise_IK_25519_AESGCM_BLAKE2b

I think we're convinced BLAKE2b is faster and better for our cases than BLAKE2s, SHA256, and SHA512, but I'm not convinced (given the existence of accelerated encryption hardware) that ChaChaPoly is better than AESGCM, which we use everywhere else.

TCP vs UDP

We already have a project (QUIC) that seeks to eliminate the TCP handshake, and TCP in general, and prepare the network for using other UDP-based protocols. It has been a mixed bag. We should keep working on it! We have struggled to enable QUIC by default due to these UDP-related issues.

This project is trying to take a parallel approach as a next step - what happens if we keep TCP? Can we sidestep our UDP issues but still eliminate a handshake in a robust way?

So, this project is going to be based on TCP. A later project may be to swap TCP for another UDP based protocol (unencrypted QUIC, UDT, something else) or to try and improve TCP using a technique like TCP_FASTOPEN, but we can do that after getting all the cryptography correct for this one.

Data flow

The broad picture is that when storage nodes check in, they will submit their Noise public key and Noise configuration to the Satellite. When the Uplink asks the Satellite who to talk to, the Satellite can return the Noise information to the Uplink, and the Uplink can establish 0-RTT Noise_IK connections.

In tests for small files, this led to a consistent 1.5x overall speedup, which is huge. With TCP_FASTOPEN in addition, the savings went to over 2.8x.

Noise_IK requests are at risk of replay attacks, so we don't want to enable them by default everywhere. We need to audit each request for idempotency before enabling it, but at least initially, Upload and Download requests from Uplinks to Nodes would be exceptionally high value.

Uploads require that the node is able to validate cryptographically that the peer it is talking to is the node id in question. Since we get that with TLS but we won't get that with Noise DH25519 keys, the Node will need to send a signed attestation by its Node key that the Noise key is indeed its public key at the end of the upload. This can be precomputed and thus fast.

Rationale

Seems good!

Implementation

Changes to the RPC server code

As a function of our migration from gRPC to DRPC, we already have a server-side demultiplexer system built in - drpcmigrate.ListenMux. Even more luckily, (and, actually, a situation planned with foresight) this multiplexing happens outside of the TLS stream, so we can use this same demultiplexer for differentiating between DRPC over TLS and DRPC over Noise.

The current prefix for DRPC over TLS is 8 bytes - DRPC!!!1. We can do a new prefix, DRPC!N!1, to indicate Noise.

publicNoiseDRPCListener = noiseconn.NewListener(
    publicMux.Route("DRPC!N!1"), p.noiseConf)
go p.public.drpc.Serve(ctx, publicNoiseDRPCListener)

The server will need to generate a static DH key and persist it somewhere, though it is okay if the DH key gets regenerated from time to time (process start is probably fine TBH).

The server will also need to choose the encryption algorithm, hashing algorithm, and DH curve to use.

It's worth mentioning that Noise has some guidance about cryptographic channel binding. If you want a node to attest that it is indeed the peer on the other end of a Noise channel, the best way for the node to attest that it is that specific Noise session is for the peer to sign the handshake hash more than anything else, exposed in this commit: d7ec1a08b0 See https://noiseprotocol.org/noise.html#channel-binding for more.

Potentially useful commits:

Changes to Node Address / NodeURL structures

Oh man, pb.Node, pb.NodeAddress, and pb.NodeTransport have rotted significantly from their original intentions. pb.NodeTransport still having GRPC flags in there gives some sense that we got this all wrong.

In fact, the original intention of pb.Node has been almost entirely superceded by NodeURL, which basically does the exact same thing as pb.Node.

The original intention of pb.Node was to specify a Node, along with the information you'd need to securely dial it. That is now a NodeURL, which is of the form

base58nodeid@host:port

This NodeURL, by virtue of having the node id, allows the client to know whether they are securely talking to the right peer (the node id is the hash of the validated certificate authority that signed the TLS leaf cert).

Unfortunately, the NodeURL as it stands is not enough to talk over Noise_IK, since the RSA keys referenced by the NodeID cannot be used in Noise.

To be able to talk over Noise, the following things are needed:

The peer public key (32 bytes) (this is only needed for Noise_IK, so this could be optional for Noise_XX).
The peer's cipher suite selection (hashing, symmetric encryption, and Diffie-Helmann)
The peer's handshake pattern (we should always use IK for Nodes, but perhaps we'll want to support XX for Satellites).

To prevent a malicious Satellite from replacing Noise public keys with something else, we'll additionally need

A signed signature, from the node's certificate chain and public key, signing that the Noise key is correct. Unfortunately, for validation, this requires also carrying around the node's certificate chain, so this is likely too large to include in a Node Address and will need to be something we provide on demand. It's unclear how much of a threat a malicious Satellite could even be, considering how much trust the Uplink puts in the Satellite.

Independent of anything else, pb.Node/pb.NodeAddress/pb.NodeTransport should be refactored so that it is a 1-1 match with NodeURL. A NodeURL should be able to be represented efficiently (not base58) as a pb.Node protobuf, and human-readable as a NodeURL. There should be a lossless conversion routine that converts a pb.Node to a NodeURL and back again.

At some future point, I'd also suggest that we should rename NodeURL to something else since we're abusing URI syntax at best, but for now let's stick with the NodeURL name to reduce churn.

This all implies some cleaned up types. Here's the new protobuf version of a NodeURL:

message NodeURL {
  // the node id, not encoded
  bytes id = 1;

  // the address for communication with the node. This address must support
  // IPv4 TCP connections (and should support IPv4 UDP connections).
  // Address here implies a host/port pair joined via net.JoinHostPort style
  // logic.
  string address = 2;

  // noise settings. If provided, the node may support noise handshakes instead
  // of TLS over TCP or UDP.
  enum NoiseProtocol {
    NOISE_UNSET = 0;
    NOISE_IK_25519_CHACHAPOLY_BLAKE2B = 1;
    NOISE_IK_25519_AESGCM_BLAKE2B = 2;
  }
  NoiseProtocol noise_proto = 3; // this is explicitly not a set.
  bytes noise_pk = 4;
}

// This type is a note signed by a node that this public key is what they are
// using. NoiseSessionAttestation should be used instead where possible.
message NoiseKeyAttestation {
    bytes node_id = 1;
    bytes node_certchain = 2;
    bytes noise_public_key = 3;
    uint64 timestamp = 4;
    bytes signature_of_public_key_and_timestamp = 5;
}

// This type is a note signed by a node that this active Noise session really
// has them on the other end.
message NoiseSessionAttestation {
    bytes node_id = 1;
    bytes node_certchain = 2;
    bytes noise_handshake_hash = 3;
    bytes signature_of_handshake_hash = 4;
}

The intention of NodeTransport was to have a list of transports that the node understood, so that clients could use newer transports on nodes that supported them, but in practice this field has just fallen into complete disuse and we've managed those issues with requiring recent versions on all nodes instead. That said, it may still make sense to have a list of supported protocols in the Node structure.

A pb.NodeURL can be serialized into a string NodeURL as follows:

base58nodeid@address?noise_proto=1&noise_pk=base58_noise_public_key

This should be backwards compatible with existing serialized NodeURLs. We should evaluate if this format is parsable by existing NodeURL parsing code (assuming it throws away query parameters).

AddressedOrderLimits should return filled in *pb.NodeURL.

Changes to the RPC client code

Dialing to a node should take this new NodeURL structure, along with whether the request is replay-attack safe.

DialNode(ctx context.Context, node *pb.NodeURL, replay_safe bool) (Conn, error)
Validate(node *pb.NodeURL, attestation *pb.NoiseKeyAttestation) (error)
ValidateSession(node *pb.NodeURL, attestation *pb.NoiseSessionAttestation) (error)

(If replay_safe is false, Noise_IK should not be used).

As an aside:

The rpc.Connector situation is a mess. For example:

TCPConnector's DialContextUnencrypted adds the DRPC specific header, which will make changing the header based on different Encryption strategies (TLS vs Noise) challenging.
HybridConnector does too much (premature generalization).
If someone provides their own DialContext, the Connector interface doesn't allow for the net.Dialer.Control style of adding something like TCP_FASTOPEN. Config.DialContext is unfortunately just one step removed from Dialer.Control, which means that if someone provides a Config.DialContext, then our library can no longer call Setsockopt on the socket before dialing. If we want to flexibly add TCP_FASTOPEN to most requests, then we need to be able to call Setsockopt before the connect() syscall happens, and that's only possible if you set Dialer.Control before DialContext is called.

We should get rid of all of the connector flexibility and have a single, unconfigurable dialer type that dials with common/socket's BackgroundDialer.

Going forward, you should be able to ask an RPC dialer pool to:

DialNode(ctx context.Context, node *pb.NodeURL, replay_safe bool) (Conn, error)

and get back a valid Conn. The logic inside DialNode should:

Consider the QUIC rollout state from common/rpc/quic_rollout.go. If QUIC is enabled, we should use that. QUIC should be disabled.
If QUIC is disabled, but replay_safe is true and the pb.NodeURL has Noise information, the dial should happen over TCP over Noise.
Otherwise the dial should happen over TCP over TLS

The RPC pool needs to keep track of QUIC, TLS, and Noise connections separately. In particular, Noise connections should be identified by the Noise public key and Noise protocol from the pb.NodeURL Noise protocol enum.

Possibly useful commits:

https://review.dev.storj.io/c/storj/common/+/9219

Changes to DRPC

DRPC should gain a feature that allows corking outgoing sends (forcing all writes into a local buffer), and then uncorking, which tells the DRPC stream to send the buffer with the next send. This is important because our existing piecestore Download protocol sends two separate requests before the node can start returning data, and we want both requests to go into the initial Noise packet.

drpcstream.Options.ManualFlush is close, but we only want to change it for a single packet, and we are possibly getting a conn from the connection pool, so a per-conn ManualFlush is hard to use.

Possibly useful commits:

Changes to the storage node

The storage node, using the new RPC server side code, should have Noise configuration generated. It should submit this Noise information as part of its contact checkin. It should provide a NoiseSessionAttestation.

Nodes should send NoiseSessionAttestations at the end of uploads so the Uplinks can do node id piece validation.

Nodes should be extended to check if the initial Download request has an order embedded and use it if so.

Possibly useful commits:

Changes to the Satellite

On contact checkin, the Satellite should check for Noise information and validate a NoiseSessionAttestation. If the NoiseSessionAttestation is valid, the Satellite should persist the *pb.NodeURL information.

Note that a Node may submit a DNS hostname as opposed to a specific IP address. Because we don't want uplinks to stress their local DNS resolution, the Satellite should perform and cache the DNS resolution of recursive A and AAAA lookups for any hostname here. Uplinks should not expect DNS resolution for NodeURLs.

The upload and download selection caches should retrieve the *pb.NodeURL information and add them to the AddressedOrderLimit structs that are sent to Uplinks.

Possibly useful commits:

Changes to the Uplink

Uplinks should be extended to use the *pb.NodeURL from the AddressedOrderLimits and use the rpc Dialing that uses those.

Our existing piecestore Download protocol sends two separate requests before the node can start returning data, and we want both requests to go into the initial Noise packet. So, Uplinks should use a new DRPC feature to cork the first Download request RPC send, so that the Download request goes out when the Uplink sends the actual order request and both get written to the first Noise packet.

An alternative strategy would be to update the Uplink to send the first order as part of the first request, but this is not backwards compatible with old storage nodes.

Possibly useful commits:

Other options

TLS session resumption

Instead of Noise, we could still use TLS. TLS 1.3 has a feature called session resumption. Session resumption negotiates a key after a first connection that can be reused for zero roundtrip session setup if both peers remember each other. The downside of this is that we wouldn't get zero roundtrip for the first connection.

Erik asked if perhaps the Satellite could establish these keys in advance and simply hand off the SSL connection session resumption information to the node. This seems possible in theory. Open questions for me: would this work in general for more than one connection? Is it safe to reestablish many connections to a node from multiple Uplinks? Would we have to negotiation session resumption information in advance per Uplink? There are a lot of cryptographic unknowns here for me, but in principle this is essentially forcing TLS into the Noise_IK shape.

Overall, I'm worried at how unusual this is, vs Noise IK, where what we're doing is what it is designed for.

QUIC session resumption

QUIC session resumption is exactly TLS 1.3 session resumption.

Multiaddresses

We might want to consider migrating to https://github.com/multiformats/multiaddr instead of NodeURLs. The challenge I see with multi-addresses is they seem to let the multi-address specify much more about the connection than we want to allow (whether or not TLS is used, what protocols are used, etc). The only parts of the multi-address we want are the stuff that get us to having a valid IP packet host, and maybe whether the peer has open TCP or UDP ports. Multi-addresses do much more than that, and so we would be in a position where we are restricting what we use in multi-addresses, though I suppose that's not much different than URLs in general. I don't know what to do here other than that I have NIH.

Wrapup

github.com/jtolio/noiseconn

To get my proof of concept working, I wrote a library that is a useful net.Conn wrapper that uses Noise. github.com/jtolio/noiseconn has good performance and has been tested as part of the proof of concept for this project.

Tracing

https://review.dev.storj.io/c/storj/storj/+/9229

TCP Fast Open

https://github.com/storj/storj/blob/main/docs/blueprints/tcp-fastopen.md

Separated UDP address support

Cleaning up NodeURL is an opportunity to add more clarity around UDP addressing. Here are some thoughts:

We could have the pb.NodeURL structure maintain a separate UDP address in addition to the TCP address in there, or perhaps just a separate port. There's no strict requirement that a node use the same port for UDP and TCP, though of course that is what we currently require. Changes to pb.NodeURL are the right place to add this support, but it should likely not be part of this blueprint.

IPv6 support

Cleaning up NodeURL is an opportunity to add more clarity around IPv6 support. Here are some thoughts:

Because customers might move from IPv4 to IPv6 networks and back, every storage node must be reachable by every network. At very least, this means every storage node must be reachable over IPv4 (IPv6 only networks should be able to hit IPv4 nodes over gateways). It's fine if nodes also support IPv6 such that IPv6-supporting clients can reach IPv6 nodes over IPv6 natively, but if data is uploaded to IPv6, we don't want it to be only available if the client is on an IPv6 supporting network. So, unless a client explicitly opts into their data potentially only being available over IPv6, every storage node must support IPv4.
For IPv6 support, we could either have the NodeURL list the IPv6 address in addition to the IPv4 address, much like the potential additional UDP information, or we could require that IPv6 node operators get a DNS entry that has both A and AAAA records, and then the Satellite fills in the appropriate address in returned *pb.NodeURLs included in AddressedLimitOrders, based on what the Uplink requested.

Again, probably not part of this blueprint.

Open issues for future work

We need to double check that uploads and downloads are replay attack safe and make them so if not. Order serial numbers should protect against this.
We should evaluate what other commands are replay attack safe. Exists, RestoreTrash, Retain, and DeletePieces do not have serial checking, but are only made by Satellites. We may want to ensure these methods are not available over Noise_IK. DeletePieces is likely the only performance sensitive call here to consider.
We should have RPC clients keep a cache of Noise public key attestations. We won't have the Satellite public key initially, but perhaps if an RPC client has spoken with a Satellite before, the Satellite could have provided a NoiseKeyAttestation, and thus future connections could be over Noise. This would be especially useful for the Gateway-MT. We would need to audit which Satellite requests are replay attack safe.
This may be more of an issue for TCP_FASTOPEN, but we should double check that 0-RTT connection establishment doesn't open us up to amplification attacks (could an attacker spoof N bytes to us and get us to send >N bytes to a third party?)
How do we improve Uplink to Satellite communication? Access grants have Satellite node IDs, but not Noise public keys. Is it a good idea to put Noise public keys in Access grants? Should we use Noise_XX and deal with the handshake? That doesn't save us much, but perhaps our Noise ciphers are faster than TLS?

28 KiB Raw Permalink Blame History

Noise over TCP (uplink to storage node)

Abstract

Background/context

The problem

QUIC

Do we need the TLS handshake?

Do we need any handshakes?

What about the Satellite?

Design

The Noise Framework

TCP vs UDP

Data flow

Rationale

Implementation

Changes to the RPC server code

Changes to Node Address / NodeURL structures

Changes to the RPC client code

Changes to DRPC

Changes to the storage node

Changes to the Satellite

Changes to the Uplink

Other options

TLS session resumption

QUIC session resumption

Multiaddresses

Wrapup

Related work

github.com/jtolio/noiseconn

Tracing

TCP Fast Open

Separated UDP address support

IPv6 support

Open issues for future work

28 KiB

Raw Permalink Blame History