Unfortunately, having QUIC based on UDP means that slightly more requests fail
than with TCP (perhaps bad node operator setup, middleware dropping packets,
unclear). This means that our long tail cancelation has to wait for more nodes,
which makes us more susceptible to slowness. So overall, QUIC has worse long
tail variability, and is slower at higher percentiles.
Would it make sense to have uplink start 35+n QUIC connections, then drop the n slowest to finish the TLS handshake, leaving the usual 35 to actually race for downloads? In essence, stopping the race early for some contestants before investing actual I/O and bandwidth resources into the data transfer.
This idea assumes that if there will be problems during the actual download, they’re likely to show up during the connection initiation phase as well. Hence the connection initiation would act as an estimate of how good a given connection would turn out later.
Hence maybe QUIC is salvageable this way without introducing a bigger change to the code? Not that I don’t like the Noise protocol from the document—being able to get rid of both handshakes does sound promising! But I’m assuming Noise would take much more engineering time.
Great idea about letting the race happen with the handshake only. We’ve actually talked about that for a while, and yeah as you point out, the main downside is you can’t do that anymore if you eliminate handshakes entirely.
I still think QUIC is salvageable, but it’s perhaps using it off label without TLS.
To your point about engineering time, the Noise blueprint actually references a number of Gerrit commits, which constitute a full implementation. We actually already have Noise working in a test environment! So, instead of having to build this all for the first time, we will only need to make it match what ends up being accepted from a blueprint perspective.
In the long term, we also have an ongoing effort to allow for dynamic long tail generation/cancelation. We still would like to avoid wasting resources that will ultimately get canceled, even without handshakes, so, we’re working on a technique that keeps track of which connections in a set are ones we’re waiting on, and heuristics of if we should add more. That’s actually also fairly far along. We’ll have something to show there soon too.
Thanks for this feedback! I expect SNOs in particular will have much more feedback on the TCP_FASTOPEN design doc once that’s in, so please watch out for that too.
Uploads require that the node is able to validate cryptographically that the
peer it is talking to is the node id in question. Since we get that with TLS
but we won’t get that with Noise DH25519 keys, the Node will need to send a
signed attestation by its Node key that the Noise key is indeed its public key
at the end of the upload. This can be precomputed and thus fast.
If I understand correctly, this means uplink is sending data before definitive verification that it is talking to the intended node. Probably limited concern as all the data is, is just an encrypted blob, but that makes me feel a little icky. (probably not rational though)
Noise_IK requests are at risk of replay attacks, so we don’t want to enable
them by default everywhere. We need to audit each request for idempotency before
enabling it, but at least initially, Upload and Download requests from Uplinks
to Nodes would be exceptionally high value.
When an uplink requests a transfer from the satellite, it could include a validity timestamp signed by the satellite after which nodes won’t accept the transfer anymore. This could at least shrink the vulnerability window for replay attacks. Downside being that the client itself also has limited time to perform the request.
You’re right! But also as you say, the data is encrypted with the storage node’s key, so you would have to have the storage node’s key to be able to decrypt it, which in some sense does provide the secrecy you’d expect if you only wanted to send the data to that node.
This is a good idea, and assuming there isn’t too much clock drift in the network, this is probably worth adding to some of the commands that aren’t already protected by Orders/OrderLimits, which essentially provide this functionality already.