Notes on storage node performance optimization on ZFS

With the current traffic, IMO caching is not really necessary, especially L2ARC or similar (though my server has a lot of RAM maybe that’s why). However, if the traffic was a lot higher, the hard drives may not be able to keep up and then caching would help to keep the bandwidth up.

1 Like

Regardless of traffic, there is this minimum latency due to HDD seek time - 5-10ms. Compared to network latency of 10-20ms this is not an insignificant amount, essentially, removing the seek time cuts the time to first byte in half. How significantly does this affect race winning needs to be measured.

I agree that with more traffic when IO queues are not empty this time will multiple, and therefore caching will become more impactful. On shared machines however this may already the case due to other activities.

1 Like

So the client downloading stuff has a not insignificant overhead and also the network. Interesting.

True, but if bad peering is limiting bandwidth, lets say a 2mb chunk but bandwidth is only
160mbit, that takes 100ms to download.

My guess is that if HDD latency actually would be important, copper users would basically be unable to win any races.

That’s not just bad peering – most folks have heavily asymmetric connection with usually much smaller upstream – mine is 800mbps down and 25mbps up. So that’s the actual limit. I haven’t however seen storj saturating it ever, not even close – which points to some latency induced inefficiency.

On the other hand, in your example it’s still 10% head start. And that is the worst case – largest chunk size according to the histogram above. Majority of chunks are smaller, with the huge amount of them being under 4k – transfer time of those is insignificant; and hence only the seek time along with network latency matters. Especially if folks have fiber connection that does not have 10ms additional overhead cable modems exhibit.

Think about it this way: lowering average latency help deliver data to customer sooner, and as a result on average increases probability of these chunks to be part of first 29 or whatever is needed to reconstruct the file.

How does reduction in latency translate into winning races on average – no idea. there are too many moving parts to do a meaningful A/B testing. Maybe having two equally old equally sized nodes on the same pool in different datasets, enable caching on one, and observe rate of cancellations. Then disable, and enable on the other, to confirm the reduction in cancellations follows enablement of caching, if any, and by how much.

It’s not black and white, there is some distribution, some may have higher latency but be closer, and still win. It’s about when you are in between – when due to being far and having high seek latency your transfer would get canceled, but reducing seek latency lowers the total latency enough to make then cut.

All the latencies here are the same order of magnitude – around 10ms.

2 Likes