I started a node for testing purposes on my zfs over one year ago (zfs performance, hashstore migration test). Now I needed the space, and moved the node to a 4TB SATA SSD (for faster migration to my storj rig). Just for fun, I ran it on the ssd for a few days, and saw almost double the ingress. I knew latency is a thing, and hdds are way slower than SSDs in i/o speeds, but I didn’t knew storj already “hammers” the HDD with just around 40-60gb a day. Upgrading all nodes to ssd is way too expensive. Do I still get a significant boost, migrating the hastore hashtabl file to an ssd?
Storj node writes are async. So… all data coming in from the Internet is dumped to memory first (and the OS copies data to storage as it has time). 40-60GB of data spread out over a day isn’t going to stress even an old HDD (less than 1MB/sec?).
Even though uploads are lumpy, and you don’t get nice smooth transfers spread out over a day… I’d be surprised if a SSD gave much of an advantage. It would definately win more races sending back to customers… but egress traffic doesn’t make huge $$$.
(I’m just going to let hashstore do-its-thing on regular HDDs.)
I haven’t either… but if it was faster it could make for some interesting setups. Like have some primary nodes on SSDs (filling quick)… then when they were full copy them to HDD (to idle and hopefully stay full)… and you start a new node on SSD again.
If that worked I’m sure someone like @Th3Van would be doing it. (Actually it looks like he really reduced the number of /24s he’s using, and his growth may have plateaued?)
Hm, I would like to know, what could cause this? In my case there is a significant gain. Is there a gain for HDD-Nodes in general, moving the Hastbl files to an ssd for faster access?
I’ve been using the same amount of /24s for like 2-3 years
I have just rearranged them (happens every 2 hours), so the first 36 nodes with the lowest amount of received data, are getting a dedicated /24, because (in my theory) nodes with lesser data (pieces), are selected more for uploads, than nodes with more data.
The rest of the nodes are sharing the same /24.
root@server030:/disk101/storj/logs# ../scripts/sort-nodes-by-total-usaged-size.sh
RANK TOTAL_USED_BYTES STATUS NODE WAN_IP PORT ERROR
---- ---------------- ------ ---- ------ ---- -----
1 3860054959056 OK server125 86.52.115.40 32125
2 3960780592740 OK server122 37.35.97.30 32122
3 4989085829914 OK server123 37.35.98.254 32123
4 5285139422910 OK server127 37.35.99.254 32127
5 5289109243500 OK server126 37.35.100.142 32126
6 5296047067498 OK server115 37.35.101.254 32115
7 5296274424044 OK server094 37.35.102.10 32094
8 5298591692744 OK server120 37.35.103.254 32120
9 5299299193730 OK server131 185.18.0.254 32131
10 5299647359836 OK server044 185.18.1.254 32044
11 5299960444574 OK server005 185.18.2.254 32005
12 5303460945158 OK server095 185.18.3.254 32095
13 5303830237260 OK server106 213.5.32.194 32106
14 5305989521836 OK server087 213.5.33.254 32087
15 5306390824530 OK server058 213.5.34.198 32058
16 5306974383970 OK server092 213.5.35.246 32092
17 5307584960236 OK server130 213.5.36.254 32130
18 5307646733804 OK server078 213.5.37.254 32078
19 5307679625646 OK server129 213.5.38.254 32129
20 5307738577882 OK server117 213.5.39.38 32117
21 5307924548772 OK server086 109.232.77.30 32086
22 5308097186604 OK server004 62.243.142.226 32004
23 5308280275250 OK server093 80.63.4.2 32093
24 5308309333512 OK server113 80.164.100.18 32113
25 5308312816750 OK server100 83.89.240.122 32100
26 5308519342958 OK server022 87.61.119.126 32022
27 5308676040124 OK server013 87.63.103.158 32013
28 5308858343504 OK server128 95.166.142.126 32128
29 5309073000236 OK server054 194.239.137.126 32054
30 5309287208450 OK server121 178.21.89.30 32121
31 5309513056088 OK server014 217.198.209.158 32014
32 5309847762220 OK server015 217.198.210.58 32015
33 5310353341266 OK server103 217.198.212.34 32103
34 5310446015404 OK server043 217.198.213.94 32043
35 5310510560740 OK server104 217.198.216.130 32104
36 5310736259992 OK server046 217.198.217.194 32046
37 5311003909410 OK server091 37.35.96.30 32091
38 5311090612144 OK server042 37.35.96.30 32042
39 5311093628922 OK server111 37.35.96.30 32111
40 5311322237746 OK server083 37.35.96.30 32083
etc...
I must be remembering an old setup: where each /24 had 3 or 4 nodes on it? I don’t have enough experience to know if small-nodes-grow-faster is a thing - but clever configuration!
You are in process of reinventing tiered storage. Carry on
But as mentioned above — it is already working this way: download to memory, and lazy flush to disk later, unless you for some reason ensured synchronous IO.
With price store further optimization was possible, where entire metadata could be in ram, or on SSD. With hashstore half the metadata is obscured and it makes it difficult to optimize on piece level, but this still will not affect responsiveness; append writers hashstore does are very cheap.
it would also be weird if SSD affected performance vastly, because HDD latency might be list 10ms, but the latency from the internet connection would be much larger.
But it’s not about internet latency. It’s about your latency vs. your competitors. Both you and your competitors are affected by “internet latency” in (assuming here likely) the same way, hence you can remove it from equation from both sides. And then you start competing on hardware.
Isn’t Internet performance the important part? The time it takes to dump uploads into RAM is instant by comparison. Most paid space is in the US, and an enormous amount of those uploads are using Storj’s S3 Gateway. A node in the US on a 10-year-old HDD “close” to that gateway will outgrow and outpay a node in Asia on ramdisk.
Winning races means having low-latency (to begin transfers) and high-throughput (to complete them quickly).
But things would be different… if SNOs were paid more for egress than simply storing cold data. Then a SSD would definately help win-more/earn-more
Winning races means having latency lower than your competitors, and throughput higher than the competitors. The absolute value does not matter actually. If there are two tortoises racing, one of them will still win.
That’s… an interesting hair to split. It’s a race. Of course it’s against competitors.
Edit: Now that you made me think about it a bit: back when they were performance-testing the network didn’t they add a short/medium-term memory to satellites: to remember “fast” nodes (to try to steer new customer uploads to them)?
If I’m not hallucinating… then node performance is even more important: not only could you lose a current race… but that could mean you aren’t even selected for an upcoming race.
Indeed, the “power of two choices” algorithm. And you are right, in pure form it exacerbates the differences. We don’t really know how it is configured right now though, so we can only speculate how much impact it would have here.
And, I don’t think it’s hair-splitting. If you have an SSD and your 50 opponents HDDs, all else equal—you will win much more frequently.
Uploads to a node with a HDD… and uploads to a node with a SSD… have gone to memory for the OS to deal with “whenever” for a couple years now (since 1.104.1?). If customer uploads aren’t waiting for a drive to sync… then a SSD can’t deliver a huge win.
Most people don’t even have Internet connections fast enough for a HDD to struggle to keep up with those async writes. And now, with hashstore… even a firehose of tiny uploads get merged into nice clean sequential writes to 1GB data logs: exactly the type of transfers HDDs prefer (instead of millions of .sj1 files SSDs are awesome at)
So that’s the “all else equal” part (except, yeah, customer downloads are faster from SSD: but egress is only a small part of payouts).
That leaves the unequal part… being Internet latency and throughput between the node and where the uploads are coming from. Which is usually the US. And usually the us-central Storj S3 gateway.
Piecestore still has synchronous elements in its hot path. Granted, less of them than in the past. Hashstore, as far as I know, doesn’t, but we’re not fully migrated as a network to assume it.
On top of that, the background processes like garbage collection will compete for I/O with live traffic—whether piecestore or hashstore. And obviously again, SSDs will have far more capacity to reduce the impact of background processes on live traffic.
I think we’re looking at the same thing from different angles and agreeing. SSDs are indeed awesome. And yes only new uploads are going into hashstore (except for SNOs who have forced it)… so SSDs are absolutely a win on piecestore (though that should be mostly read-only now?)
I guess I’m saying a fresh node in us-central, on a vanilla HDD… will outgrow/out-race/out-earn a SSD-based node in any other geo: simply due to it’s better Internet connection to where most-of-the-paying-data-comes-from… most-of-the-time.
I think the only thing that has a stronger correlation to earnings than that Internet advantage… is just the raw number of /24s you have in the US. HDD vs. SSD doesn’t even matter.
The special device on zfs allowed an amazing arrangement that was killed by hash store.
Think about it.
For small pieces — latency matters. Throughput does not. Small pieces can reside on SSD. All of them. Because they are small, a lot of them fit into moderately small SSD.
For large pieces — throughput prevails. Latency is not that important. It does not matter if time to first byte is 10ms or 100 if transfer time is 1000ms. Those can sit on HDD.
It’s a win-win.
You can’t do this with hashstore. Overal node responsiveness went down. Migrating my nodes to hashstore was a mistake. Now I have to deal with spiky compactions and my special device is empty. What a waste. I would return back to piecestore if I could in a blink.
Actually….. I’ll ask robot friend to create a converter. It was suggested to me to try hashstore. I did. I want piecestore back. I’ll get it back.
On the internet latency — some folks have sub-3ms latency from home to the interwebs:
~ % ping 1.1.1.1
PING 1.1.1.1 (1.1.1.1): 56 data bytes
64 bytes from 1.1.1.1: icmp_seq=0 ttl=52 time=2.352 ms
64 bytes from 1.1.1.1: icmp_seq=1 ttl=52 time=2.238 ms
64 bytes from 1.1.1.1: icmp_seq=2 ttl=52 time=2.192 ms
64 bytes from 1.1.1.1: icmp_seq=3 ttl=52 time=1.985 ms
^C
Under tbis circumstance HDD seek is a whole order of magnitude larger
Pieces are written into logs in batches, in append mode. There are no small writes that zfs could optimize and send to special device. The fact that there are 100 small pieces in one large write is hidden from the filesystem.