Huge ingress improvement HDD vs. SSD?

I started a node for testing purposes on my zfs over one year ago (zfs performance, hashstore migration test). Now I needed the space, and moved the node to a 4TB SATA SSD (for faster migration to my storj rig). Just for fun, I ran it on the ssd for a few days, and saw almost double the ingress. I knew latency is a thing, and hdds are way slower than SSDs in i/o speeds, but I didn’t knew storj already “hammers” the HDD with just around 40-60gb a day. Upgrading all nodes to ssd is way too expensive. Do I still get a significant boost, migrating the hastore hashtabl file to an ssd?

TiA

Storj node writes are async. So… all data coming in from the Internet is dumped to memory first (and the OS copies data to storage as it has time). 40-60GB of data spread out over a day isn’t going to stress even an old HDD (less than 1MB/sec?).

Even though uploads are lumpy, and you don’t get nice smooth transfers spread out over a day… I’d be surprised if a SSD gave much of an advantage. It would definately win more races sending back to customers… but egress traffic doesn’t make huge $$$.

(I’m just going to let hashstore do-its-thing on regular HDDs.)

I am running multiple nodes on ssd for years now. Never saw a performance improvement over hdd. :thinking:

I haven’t either… but if it was faster it could make for some interesting setups. Like have some primary nodes on SSDs (filling quick)… then when they were full copy them to HDD (to idle and hopefully stay full)… and you start a new node on SSD again.

If that worked I’m sure someone like @Th3Van would be doing it. (Actually it looks like he really reduced the number of /24s he’s using, and his growth may have plateaued?)

Hm, I would like to know, what could cause this? In my case there is a significant gain. Is there a gain for HDD-Nodes in general, moving the Hastbl files to an ssd for faster access?

I’ve been using the same amount of /24s for like 2-3 years :slight_smile:

I have just rearranged them (happens every 2 hours), so the first 36 nodes with the lowest amount of received data, are getting a dedicated /24, because (in my theory) nodes with lesser data (pieces), are selected more for uploads, than nodes with more data.

The rest of the nodes are sharing the same /24.

root@server030:/disk101/storj/logs# ../scripts/sort-nodes-by-total-usaged-size.sh

RANK   TOTAL_USED_BYTES STATUS     NODE       WAN_IP             PORT   ERROR
----   ---------------- ------     ----       ------             ----   -----
1      3860054959056    OK         server125  86.52.115.40       32125  
2      3960780592740    OK         server122  37.35.97.30        32122  
3      4989085829914    OK         server123  37.35.98.254       32123  
4      5285139422910    OK         server127  37.35.99.254       32127  
5      5289109243500    OK         server126  37.35.100.142      32126  
6      5296047067498    OK         server115  37.35.101.254      32115  
7      5296274424044    OK         server094  37.35.102.10       32094  
8      5298591692744    OK         server120  37.35.103.254      32120  
9      5299299193730    OK         server131  185.18.0.254       32131  
10     5299647359836    OK         server044  185.18.1.254       32044  
11     5299960444574    OK         server005  185.18.2.254       32005  
12     5303460945158    OK         server095  185.18.3.254       32095  
13     5303830237260    OK         server106  213.5.32.194       32106  
14     5305989521836    OK         server087  213.5.33.254       32087  
15     5306390824530    OK         server058  213.5.34.198       32058  
16     5306974383970    OK         server092  213.5.35.246       32092  
17     5307584960236    OK         server130  213.5.36.254       32130  
18     5307646733804    OK         server078  213.5.37.254       32078  
19     5307679625646    OK         server129  213.5.38.254       32129  
20     5307738577882    OK         server117  213.5.39.38        32117  
21     5307924548772    OK         server086  109.232.77.30      32086  
22     5308097186604    OK         server004  62.243.142.226     32004  
23     5308280275250    OK         server093  80.63.4.2          32093  
24     5308309333512    OK         server113  80.164.100.18      32113  
25     5308312816750    OK         server100  83.89.240.122      32100  
26     5308519342958    OK         server022  87.61.119.126      32022  
27     5308676040124    OK         server013  87.63.103.158      32013  
28     5308858343504    OK         server128  95.166.142.126     32128  
29     5309073000236    OK         server054  194.239.137.126    32054  
30     5309287208450    OK         server121  178.21.89.30       32121  
31     5309513056088    OK         server014  217.198.209.158    32014  
32     5309847762220    OK         server015  217.198.210.58     32015  
33     5310353341266    OK         server103  217.198.212.34     32103  
34     5310446015404    OK         server043  217.198.213.94     32043  
35     5310510560740    OK         server104  217.198.216.130    32104  
36     5310736259992    OK         server046  217.198.217.194    32046  

37     5311003909410    OK         server091  37.35.96.30        32091  
38     5311090612144    OK         server042  37.35.96.30        32042  
39     5311093628922    OK         server111  37.35.96.30        32111  
40     5311322237746    OK         server083  37.35.96.30        32083  
etc...

Th3Van.dk

I must be remembering an old setup: where each /24 had 3 or 4 nodes on it? I don’t have enough experience to know if small-nodes-grow-faster is a thing - but clever configuration!

You are in process of reinventing tiered storage. Carry on :slight_smile:

But as mentioned above — it is already working this way: download to memory, and lazy flush to disk later, unless you for some reason ensured synchronous IO.

With price store further optimization was possible, where entire metadata could be in ram, or on SSD. With hashstore half the metadata is obscured and it makes it difficult to optimize on piece level, but this still will not affect responsiveness; append writers hashstore does are very cheap.

it would also be weird if SSD affected performance vastly, because HDD latency might be list 10ms, but the latency from the internet connection would be much larger.

But it’s not about internet latency. It’s about your latency vs. your competitors. Both you and your competitors are affected by “internet latency” in (assuming here likely) the same way, hence you can remove it from equation from both sides. And then you start competing on hardware.

Isn’t Internet performance the important part? The time it takes to dump uploads into RAM is instant by comparison. Most paid space is in the US, and an enormous amount of those uploads are using Storj’s S3 Gateway. A node in the US on a 10-year-old HDD “close” to that gateway will outgrow and outpay a node in Asia on ramdisk.

Winning races means having low-latency (to begin transfers) and high-throughput (to complete them quickly).

But things would be different… if SNOs were paid more for egress than simply storing cold data. Then a SSD would definately help win-more/earn-more :money_mouth_face:

No.

Winning races means having latency lower than your competitors, and throughput higher than the competitors. The absolute value does not matter actually. If there are two tortoises racing, one of them will still win.

That’s… an interesting hair to split. It’s a race. Of course it’s against competitors. :winking_face_with_tongue:

Edit: Now that you made me think about it a bit: back when they were performance-testing the network didn’t they add a short/medium-term memory to satellites: to remember “fast” nodes (to try to steer new customer uploads to them)?

If I’m not hallucinating… then node performance is even more important: not only could you lose a current race… but that could mean you aren’t even selected for an upcoming race. :turtle:

Indeed, the “power of two choices” algorithm. And you are right, in pure form it exacerbates the differences. We don’t really know how it is configured right now though, so we can only speculate how much impact it would have here.

And, I don’t think it’s hair-splitting. If you have an SSD and your 50 opponents HDDs, all else equal—you will win much more frequently.

Uploads to a node with a HDD… and uploads to a node with a SSD… have gone to memory for the OS to deal with “whenever” for a couple years now (since 1.104.1?). If customer uploads aren’t waiting for a drive to sync… then a SSD can’t deliver a huge win.

Most people don’t even have Internet connections fast enough for a HDD to struggle to keep up with those async writes. And now, with hashstore… even a firehose of tiny uploads get merged into nice clean sequential writes to 1GB data logs: exactly the type of transfers HDDs prefer (instead of millions of .sj1 files SSDs are awesome at)

So that’s the “all else equal” part (except, yeah, customer downloads are faster from SSD: but egress is only a small part of payouts).

That leaves the unequal part… being Internet latency and throughput between the node and where the uploads are coming from. Which is usually the US. And usually the us-central Storj S3 gateway.

Piecestore still has synchronous elements in its hot path. Granted, less of them than in the past. Hashstore, as far as I know, doesn’t, but we’re not fully migrated as a network to assume it.

On top of that, the background processes like garbage collection will compete for I/O with live traffic—whether piecestore or hashstore. And obviously again, SSDs will have far more capacity to reduce the impact of background processes on live traffic.

I think we’re looking at the same thing from different angles and agreeing. SSDs are indeed awesome. And yes only new uploads are going into hashstore (except for SNOs who have forced it)… so SSDs are absolutely a win on piecestore (though that should be mostly read-only now?)

I guess I’m saying a fresh node in us-central, on a vanilla HDD… will outgrow/out-race/out-earn a SSD-based node in any other geo: simply due to it’s better Internet connection to where most-of-the-paying-data-comes-from… most-of-the-time.

I think the only thing that has a stronger correlation to earnings than that Internet advantage… is just the raw number of /24s you have in the US. HDD vs. SSD doesn’t even matter.

So now we are reinventing ZFS :stuck_out_tongue:

The special device on zfs allowed an amazing arrangement that was killed by hash store.

Think about it.

For small pieces — latency matters. Throughput does not. Small pieces can reside on SSD. All of them. Because they are small, a lot of them fit into moderately small SSD.

For large pieces — throughput prevails. Latency is not that important. It does not matter if time to first byte is 10ms or 100 if transfer time is 1000ms. Those can sit on HDD.

It’s a win-win.

You can’t do this with hashstore. Overal node responsiveness went down. Migrating my nodes to hashstore was a mistake. Now I have to deal with spiky compactions and my special device is empty. What a waste. I would return back to piecestore if I could in a blink.

Actually….. I’ll ask robot friend to create a converter. It was suggested to me to try hashstore. I did. I want piecestore back. I’ll get it back.

On the internet latency — some folks have sub-3ms latency from home to the interwebs:

~ % ping 1.1.1.1
PING 1.1.1.1 (1.1.1.1): 56 data bytes
64 bytes from 1.1.1.1: icmp_seq=0 ttl=52 time=2.352 ms
64 bytes from 1.1.1.1: icmp_seq=1 ttl=52 time=2.238 ms
64 bytes from 1.1.1.1: icmp_seq=2 ttl=52 time=2.192 ms
64 bytes from 1.1.1.1: icmp_seq=3 ttl=52 time=1.985 ms
^C

Under tbis circumstance HDD seek is a whole order of magnitude larger

Do you mean, ZFS is unable to put smaller chunks of a large file on SSD?

Pieces are written into logs in batches, in append mode. There are no small writes that zfs could optimize and send to special device. The fact that there are 100 small pieces in one large write is hidden from the filesystem.