Maybe @Th3Van could switch some of his node to hashstore and report back?
And still.. somehow both nodes had more traffic and a higher completed count in total (could be variations in traffic on the network ofc.).
Wed Feb 19 21:29:23 CEST 2025 : All 132 nodes has successfully migrated to hashstore in 525.34 hours (649.316.427.445.212 Bytes in 3.530.760.132 pieces)
Yeah itās kind a bug, which should be solved. All four nodes are running on SSDs which are basicly idling and yet it is still 2% to 3% cancel-rate.
Ah great thanks. I would have assumed he will wait for the official rollout and didnāt check.
So now will we know which store performs better on his nodes?
I just had a look at one of my recently converted nodes (~8TB stored):
PIECESTORE HASHSTORE
========== AUDIT ============== ========== AUDIT ==============
Critically failed: 0 Critically failed: 0
Critical Fail Rate: 0.000% Critical Fail Rate: 0.000%
Recoverable failed: 0 Recoverable failed: 0
Recoverable Fail Rate: 0.000% Recoverable Fail Rate: 0.000%
Successful: 13397 Successful: 26182
Success Rate: 100.000% Success Rate: 100.000%
========== DOWNLOAD =========== ========== DOWNLOAD ===========
Failed: 576 Failed: 4
Fail Rate: 0.097% Fail Rate: 0.001%
Canceled: 3039 Canceled: 2419
Cancel Rate: 0.510% Cancel Rate: 0.391%
Successful: 591809 Successful: 617004
Success Rate: 99.393% Success Rate: 99.609%
========== UPLOAD ============= ========== UPLOAD =============
Rejected: 0 Rejected: 0
Acceptance Rate: 100.000% Acceptance Rate: 100.000%
---------- accepted ----------- ---------- accepted -----------
Failed: 82 Failed: 0
Fail Rate: 0.074% Fail Rate: 0.000%
Canceled: 1461 Canceled: 1576
Cancel Rate: 1.323% Cancel Rate: 1.072%
Successful: 108921 Successful: 145477
Success Rate: 98.603% Success Rate: 98.928%
========== REPAIR DOWNLOAD ==== ========== REPAIR DOWNLOAD ====
Failed: 0 Failed: 2
Fail Rate: 0.000% Fail Rate: 0.003%
Canceled: 0 Canceled: 3
Cancel Rate: 0.000% Cancel Rate: 0.004%
Successful: 57359 Successful: 76860
Success Rate: 100.000% Success Rate: 99.994%
========== REPAIR UPLOAD ====== ========== REPAIR UPLOAD ======
Failed: 0 Failed: 0
Fail Rate: 0.000% Fail Rate: 0.000%
Canceled: 69 Canceled: 74
Cancel Rate: 1.254% Cancel Rate: 0.975%
Successful: 5434 Successful: 7516
Success Rate: 98.746% Success Rate: 99.025%
========== DELETE ============= ========== DELETE =============
Failed: 0 Failed: 0
Fail Rate: 0.000% Fail Rate: 0.000%
Successful: 0 Successful: 0
Success Rate: 0.000% Success Rate: 0.000%
This is a host similar til @arrogantrabbit setup with ZFS (2x vdev on raidz1 each with 4 rust disks + dedicated ssd special mirror), plenty of RAM and performance headroom.
One difference however - itās connected to a fibre with public IP, so no VPN in the loop.
My conclusion is that - for my setup - thereās an improvement in successrates:
+0.2% DOWNLOAD
+0.3% UPLOAD
Perhaps we need to factor in the geographical location of these nodes as well ? This host is located in the nordic part of EU.
Can anyone provide me some documentation on how to use / run the success rate script?
Cheers in advance.
You must have log on info
I have two notes here at the same IP. Similar data sizes, one is using memtbl(but old storage not migrated) and the other is pieces store.
Download success rate is slightly better on memtbl, 99.5% vs 98.2%
Upload rates are similar, 99.6% for both.
Another much smaller node, same IP, but entirely migrated to memtbl, has download rate 99.8% and upload of 99.9%
another node is on a different IP, has good ram and CPU, but slow NFS-network mount storage. itās on memtbl but most data is still on pieces store. Download is 97.5% and upload is 99.1%
Then I have two potato nodes. only 1 core. only 1GB ram. slow NFS mount storage. They are using hashtbl for new storage but most data is still on pieces storage.
potato 1 has download success of 90.8% and upload success of 91.5%
potato 2 has download success of 98.5% and upload success of 97.8%
potato 1 has less data migrated to hashstore, but iām not sure on the reason for itās higher cancel rate.
How did you activate memtable now? I asked around, but didnāt receive any concrete answer as of now.
A āmonthā turned out to be an exaggerations. I do have 20 days worth of logs though.
Here: Node in WA
root@storagenode-seven:~ # zsh -c 'for f in /var/log/storagenode.*.bz2(n); do echo $f: $({bzcat $f > /tmp/1.txt && ./successrate.sh /tmp/1.txt} | grep -A 5 accepted | grep "Cancel Rate"); done'
/var/log/storagenode.log.0.bz2: Cancel Rate: 2.529%
/var/log/storagenode.log.1.bz2: Cancel Rate: 2.593%
/var/log/storagenode.log.2.bz2: Cancel Rate: 1.916%
/var/log/storagenode.log.3.bz2: Cancel Rate: 2.186%
/var/log/storagenode.log.4.bz2: Cancel Rate: 3.060%
/var/log/storagenode.log.5.bz2: Cancel Rate: 1.345% <--- switch to hashstore
/var/log/storagenode.log.6.bz2: Cancel Rate: 0.033%
/var/log/storagenode.log.7.bz2: Cancel Rate: 0.035%
/var/log/storagenode.log.8.bz2: Cancel Rate: 0.069%
/var/log/storagenode.log.9.bz2: Cancel Rate: 0.085%
/var/log/storagenode.log.10.bz2: Cancel Rate: 0.178%
/var/log/storagenode.log.11.bz2: Cancel Rate: 0.124%
/var/log/storagenode.log.12.bz2: Cancel Rate: 0.110%
/var/log/storagenode.log.13.bz2: Cancel Rate: 0.160%
/var/log/storagenode.log.14.bz2: Cancel Rate: 0.133%
/var/log/storagenode.log.15.bz2: Cancel Rate: 0.073%
/var/log/storagenode.log.16.bz2: Cancel Rate: 0.034%
/var/log/storagenode.log.17.bz2: Cancel Rate: 0.060%
/var/log/storagenode.log.18.bz2: Cancel Rate: 0.095%
/var/log/storagenode.log.19.bz2: Cancel Rate: 0.023%
Node in CA:
storj-eight# zsh -c 'for f in /var/log/storagenode.*.bz2(n); do echo $f: $({bzcat $f > /tmp/1.txt && ./successrate.sh /tmp/1.txt} | grep -A 5 accepted | grep "Cancel Rate"); done'
/var/log/storagenode.log.0.bz2: Cancel Rate: 1.421%
/var/log/storagenode.log.1.bz2: Cancel Rate: 1.968%
/var/log/storagenode.log.2.bz2: Cancel Rate: 1.284%
/var/log/storagenode.log.3.bz2: Cancel Rate: 1.617%
/var/log/storagenode.log.4.bz2: Cancel Rate: 1.686%
/var/log/storagenode.log.5.bz2: Cancel Rate: 0.124% <-- switch to hashstore
/var/log/storagenode.log.6.bz2: Cancel Rate: 0.057%
/var/log/storagenode.log.7.bz2: Cancel Rate: 0.037%
/var/log/storagenode.log.8.bz2: Cancel Rate: 0.049%
/var/log/storagenode.log.9.bz2: Cancel Rate: 0.058%
/var/log/storagenode.log.10.bz2: Cancel Rate: 0.100%
/var/log/storagenode.log.11.bz2: Cancel Rate: 0.075%
/var/log/storagenode.log.12.bz2: Cancel Rate: 0.077%
/var/log/storagenode.log.13.bz2: Cancel Rate: 0.108%
/var/log/storagenode.log.14.bz2: Cancel Rate: 0.072%
/var/log/storagenode.log.15.bz2: Cancel Rate: 0.048%
/var/log/storagenode.log.16.bz2: Cancel Rate: 0.033%
/var/log/storagenode.log.17.bz2: Cancel Rate: 0.052%
/var/log/storagenode.log.18.bz2: Cancel Rate: 0.052%
/var/log/storagenode.log.19.bz2: Cancel Rate: 0.022%
So⦠I donāt think itās a fluke. Itās on two different (albeit configured by the same dude) servers
To add a broader perspective, Iāve done similar on my node (from previous post).
2025-09-10 Cancel Rate: 0.868%
2025-09-09 Cancel Rate: 1.163%
2025-09-08 Cancel Rate: 1.072%
2025-09-07 Cancel Rate: 4.122% <-- IGNORE | FAILED HDD
2025-09-06 Cancel Rate: 1.905% | RESILVER ON POOL
2025-09-05 Cancel Rate: 0.986%
2025-09-04 Cancel Rate: 1.396%
2025-09-03 Cancel Rate: 1.106%
2025-09-02 Cancel Rate: 1.033%
2025-09-01 Cancel Rate: 0.877%
2025-08-31 Cancel Rate: 1.242%
2025-08-30 Cancel Rate: 0.898% <-- HASHSTORE MIG COMPLETE
2025-08-29 Cancel Rate: 0.925%
2025-08-28 Cancel Rate: 1.235%
2025-08-27 Cancel Rate: 1.024%
2025-08-26 Cancel Rate: 1.135%
2025-08-25 Cancel Rate: 1.040%
2025-08-24 Cancel Rate: 1.052%
2025-08-23 Cancel Rate: 1.235%
2025-08-22 Cancel Rate: 1.391%
2025-08-21 Cancel Rate: 1.280%
2025-08-20 Cancel Rate: 1.306%
2025-08-19 Cancel Rate: 1.203%
2025-08-18 Cancel Rate: 1.067%
2025-08-17 Cancel Rate: 1.146%
2025-08-16 Cancel Rate: 1.323%
2025-08-15 Cancel Rate: 1.391%
With this more detailed historical outlook, itās not so clear to me anymore.
Avg of rates before and after migration:
Piecestore 1,184%
Hashstore 1,141%
Difference is a mere +0,043% ![]()
does someone tried defragmentation after migrates, on ntfs it is 99% all fragmented
does someone tried defragmentation after migrates
My ZFS Pool fragmentation is steady ~27% before and after migration.
Hmm.. interesting. So your cancel rate was high to begin with. And did not change. Mine was very small, and the change is noticeable.
I just realized that since storagenode does small appends to huge files, I shall reduce record size from 128k to say 32, or 16k, to better align with the usage pattern. Maybe this will help? I shall try that.
Truth be told, the storage node isnāt in a position to know if an upload is successful or not. The peer that knows if an upload to a storagenode was successful is actually the Satellite (and the Uplink temporarily by proxy, but it has no consistent memory). A storagenode may think an upload to it was successful, but only the Satellite keeps track of which nodes were actually part of the fastest set. Even if a node never gets a cancelation and by all appearances the upload looks successful, unless the Satellite agrees, that data is considered garbage. The uplink attempts to alert the storage node if it was unsuccessful in a variety of ways, and whenever the Uplink can tell the storage node it lost the race, that is good, because the storage node then can preemptively clean up that data instead of leaving it around and waiting for garbage collection.
So! We have a possible hypothesis (though we still need to collect data to figure out if itās right). Hashstore has a much smaller critical section that waits for disk activity than piecestore does. Perhaps hashstore is better at receiving cancelation requests than piecestore is? Perhaps piecestore has an unreasonably high success rate because a higher percentage of uploads are false successes? In this scenario, a lower success rate may actually simply mean a more accurate success rate.
So here is what we need to gather and check (and perhaps forum readers can help):
- What is the percent of unsuccessful uploads we actually expect in practice due to long tail cancelation? In theory, this is as high as 30% (!), because uploads to the network upload 70 pieces per segment and only wait for the fastest 49. So, 21/70 pieces are theoretically canceled, but do we sometimes do better than 49? How often?
- How often is very-recently-uploaded data immediately eligible for garbage collection? For hashstore? For piecestore?
- What percent of pieces are considered successful across all nodes? Hashstore nodes, piecestore nodes? It should match the Satellite, right? If itās higher than what the Satellite tracks as successful, then we have āfalseā successes.
Basically, a network-wide average success rate of 100% is actually bad. We always upload more than we need to be successful, and, theoretically, across all nodes (some nodes will be āsuccessfulā less often than others, i.e. win less races) we should be seeing about 70% piece success across all pieces (49/70). If the network as a whole is reporting average scores higher than 70%, then that means the network is gaining garbage at a faster rate than we expect. 30% of uploads should be considered failed by the nodes, so the nodes can clean that data up instead of letting it sit around unpaid until garbage collection.
Suffice it to say one thing weāve been noticing on the hashstore-based Select network is less garbage. Perhaps this is why, and perhaps we shouldnāt be afraid of lower-than-piecestore success rates? But this is just a hypothesis, weāll try to disconfirm it.
Edited to add: an individual storage node can certainly be more successful and try to target a 100% success rate. Any success rate above the network average means that it is more likely to win races than the average node. Any success rate below the network average means that it is less likely to win races. But the network average should very definitely not be 100%, or something has gone wrong and all the nodes are storing way more data than they are getting paid for (garbage). So I would expect the average node even here on the forum is losing races in double digit percents. This should be showing in both hashstore and piecestore. If hashstore says this and piecestore doesnāt, then Iām actually more suspicious of piecestore.
Perhaps piecestore has an unreasonably high success rate because a higher percentage of uploads are false successes? In this scenario, a lower success rate may actually simply mean a more accurate success rate.
This makes a lot of sense ā unless satellite is very good with node selection, and my nodes are significantly faster than everyone elseās (they arenāt, if anything, my cable internet has quite a high latency compared to fiber that literally everyone around me is able to get).
This hypothesis (that piecestore cancellation rate as seen by node is inaccurate) shall be quite easy to validate: can you please check satellite side, what was the actual success rate on this node historically? Was it actually 99.9% and/or arecurrently reported values (97-98%) closer to actual, or is everything BS?
12nRLLozTqKdD5KRjLu23C8Xz6ZKxkzngpRhoKUZtfruDcCMpar
If historical data is not available ā here is my other node that still runs piecestore, that also has unrealistically low cancel rate:
1zWMZAUsyxpi1v9J5Me9ukcdKiaCDEpeftiTSdcrXJh7RvHEyf
========== AUDIT ==============
Critically failed: 0
Critical Fail Rate: 0.000%
Recoverable failed: 0
Recoverable Fail Rate: 0.000%
Successful: 36427
Success Rate: 100.000%
========== DOWNLOAD ===========
Failed: 1736
Fail Rate: 0.345%
Canceled: 2990
Cancel Rate: 0.593%
Successful: 499111
Success Rate: 99.062%
========== UPLOAD =============
Rejected: 0
Acceptance Rate: 100.000%
---------- accepted -----------
Failed: 1
Fail Rate: 0.001%
Canceled: 38
Cancel Rate: 0.024%
Successful: 155155
Success Rate: 99.975%
========== REPAIR DOWNLOAD ====
Failed: 0
Fail Rate: 0.000%
Canceled: 2
Cancel Rate: 0.003%
Successful: 71758
Success Rate: 99.997%
========== REPAIR UPLOAD ======
Failed: 0
Fail Rate: 0.000%
Canceled: 43
Cancel Rate: 0.695%
Successful: 6147
Success Rate: 99.305%
========== DELETE =============
Failed: 0
Fail Rate: 0.000%
Successful: 0
Success Rate: 0.000%
Hashstore has a much smaller critical section that waits for disk activity than piecestore does. Perhaps hashstore is better at receiving cancelation requests than piecestore is? Perhaps piecestore has an unreasonably high success rate because a higher percentage of uploads are false successes?
This does make a lot of sense. But why the stark difference between normal uploads and ārepair uploadsā. Do repairs have a even smaller critical section? Or better at receiving cancellations notifications?
I just realized that since storagenode does small appends to huge files, I shall reduce record size from 128k to say 32, or 16k, to better align with the usage pattern. Maybe this will help? I shall try that.
Small appends, yes, but theyāre not synced (unless you set STORJ_HASHSTORE_STORE_SYNC_WRITES, which is not the default), so OS is free to wait until thereās an actually sizable chunk of data to write. With some luck, OS might accumulate dozens of megabytes before making actual I/O.