Poor Success rate on hashstore

Since the conversion to hashstore success rates on repair uploads have plummeted on all hashstore nodes. Nodes still on peice store still have 97% repair upload.

All other upload/download stats have remained consistent.

Before Conversion:

========== REPAIR UPLOAD ======
Failed:                0
Fail Rate:             0.000%
Canceled:              199
Cancel Rate:           2.604%
Successful:            7443
Success Rate:          97.396%

After Conversion

========== REPAIR UPLOAD ======
Failed:                0
Fail Rate:             0.000%
Canceled:              1925
Cancel Rate:           38.834%
Successful:            3032
Success Rate:          61.166%

Great that you have started this as I am seeing very bad success rates as well since hashstore.

However I cannot confirm that it is for repair uploads only.

docker logs storagenode 2>&1 | grep uploaded | grep -v REPAIR | wc -l
169400
docker logs storagenode 2>&1 | grep uploaded | grep REPAIR | wc -l
16426
docker logs storagenode 2>&1 | grep "upload canceled" | grep -v REPAIR | wc -l
66168
docker logs storagenode 2>&1 | grep "upload canceled" | grep REPAIR | wc -l
5430

This puts the success rate for uploads at around 70% which is worse than what it was on piecestore:

Guess when migration started. I am still investigating though. It could be that the numbers will be recovering as the migration has completed and the node has not been restarted yet. But currently we see around 70% for uploads and 93% on downloads on this node.

Is active migration still running on your nodes? If yes, what happens with success rates when you disable active migration for a day or so and only run passive migration?

1 Like

No, migration is complete.

What I think is happening:
larger nodes were bottlenecked by the sheer number of pieces they were storing. This made them slow to respond, which is why smaller nodes got high success rates. When I say larger nodes, I don’t mean 24TB drives connected to a raspberrypi.
Now that larger nodes are able to store/retrieve as fast as they can, the smaller (and by extension slower) nodes can’t compete with them, which is why there is a drop in success rates.

TL;DR: You are scanning 1000 files to get the data and send it out. I’m scanning 1,000,000 files. You are faster=higher success rate for you. Now that I’m scanning 1000 files as well (and you are scanning 10) I’m faster since my network pipe isn’t waiting for the node to retrieve data = higher success rate for me.

1 Like

What you say makes sense, but - This is only happening on Repair Uploads, not Uploads, Not Downloads, not Repair Downloads.

And, node is far from a raspberry pi.

I’m getting these values, on a quite new node. Migration is all done.

docker logs STORJ-1 2>&1 | grep uploaded | grep -v REPAIR | wc -l
1770
docker logs STORJ-1 2>&1 | grep uploaded | grep REPAIR | wc -l
118
docker logs STORJ-1 2>&1 | grep “upload canceled” | grep -v REPAIR | wc -l
36
docker logs STORJ-1 2>&1 | grep “upload canceled” | grep REPAIR | wc -l
1

These are my success rates for one node:


If I remember correctly they got way better than before. I can try to find old Pictures or info, but if I remember correctly I had fail rates of about 1% before. Now it’s just 0,1-0,2 %

1 Like

There is an easy explaination. The repair workers are mostly running in a single location and you might have bad latency to that location. It would not impact repair downloads. There is no long tail cancelation on repair downloads. And normal uploads and downloads are comming from a different location. In fact on my nodes I even see different success rates depending on which customer is dominating the activity. In the middle of the night a customer close to me kicks off a backup job and I see high activity and great success rates. And the next day same happens but this time a customer far away and my success rate goes down.

4 Likes

But ONLY on nodes CONVERTED to HASHSTORE. Node that is not converted to hashstore, runs 97% on Repair Upload.

1 Like

After a node restart the upload success rate has slightly improve to currently 74%. But it is still much lower than on it was before on piecestore.

I noticed a small drop on success rates while migration was running but it returned to normal once migration was finished. Even during migration it never dropped below 90%.

I also observe successrate degradation after migration to hashstore completed.

While my results are not as dramatic as ones reported by OP – there is some regression indeed and it needs to be looked at. If anything – the issue seems to be amplified: the crappier the success rate was to begin with – the more it regresses.

Biggish node in Washington:

        PIECESTORE                              HASHSTORE
========== AUDIT ============== 	========== AUDIT ==============
Critically failed:     0        	Critically failed:     0
Critical Fail Rate:    0.000%   	Critical Fail Rate:    0.000%
Recoverable failed:    0        	Recoverable failed:    0
Recoverable Fail Rate: 0.000%   	Recoverable Fail Rate: 0.000%
Successful:            3238     	Successful:            3748
Success Rate:          100.000% 	Success Rate:          100.000%
========== DOWNLOAD =========== 	========== DOWNLOAD ===========
Failed:                438      	Failed:                710
Fail Rate:             0.287%   	Fail Rate:             0.393%
Canceled:              1714     	Canceled:              1539
Cancel Rate:           1.123%   	Cancel Rate:           0.852%
Successful:            150527   	Successful:            178486
Success Rate:          98.591%  	Success Rate:          98.756%
========== UPLOAD ============= 	========== UPLOAD =============
Rejected:              0        	Rejected:              0
Acceptance Rate:       100.000% 	Acceptance Rate:       100.000%
---------- accepted ----------- 	---------- accepted -----------
Failed:                0        	Failed:                0
Fail Rate:             0.000%   	Fail Rate:             0.000%
Canceled:              33       	Canceled:              1450
Cancel Rate:           0.069%   	Cancel Rate:           2.678%
Successful:            47952    	Successful:            52693
Success Rate:          99.931%  	Success Rate:          97.322%
========== REPAIR DOWNLOAD ==== 	========== REPAIR DOWNLOAD ====
Failed:                0        	Failed:                1
Fail Rate:             0.000%   	Fail Rate:             0.004%
Canceled:              1        	Canceled:              1
Cancel Rate:           0.029%   	Cancel Rate:           0.004%
Successful:            3500     	Successful:            24399
Success Rate:          99.971%  	Success Rate:          99.992%
========== REPAIR UPLOAD ====== 	========== REPAIR UPLOAD ======
Failed:                0        	Failed:                0
Fail Rate:             0.000%   	Fail Rate:             0.000%
Canceled:              16       	Canceled:              432
Cancel Rate:           2.381%   	Cancel Rate:           14.343%
Successful:            656      	Successful:            2580
Success Rate:          97.619%  	Success Rate:          85.657%
========== DELETE ============= 	========== DELETE =============
Failed:                0        	Failed:                0
Fail Rate:             0.000%   	Fail Rate:             0.000%
Successful:            0        	Successful:            0
Success Rate:          0.000%   	Success Rate:          0.000%

Small-ish node in California:

        PIECESTORE                               HASHSTORE
========== AUDIT ==============      ========== AUDIT ==============
Critically failed:     0             Critically failed:     0
Critical Fail Rate:    0.000%        Critical Fail Rate:    0.000%
Recoverable failed:    0             Recoverable failed:    0
Recoverable Fail Rate: 0.000%        Recoverable Fail Rate: 0.000%
Successful:            985           Successful:            1234
Success Rate:          100.000%      Success Rate:          100.000%
========== DOWNLOAD ===========      ========== DOWNLOAD ===========
Failed:                168           Failed:                125
Fail Rate:             0.115%        Fail Rate:             0.077%
Canceled:              680           Canceled:              600
Cancel Rate:           0.464%        Cancel Rate:           0.371%
Successful:            145549        Successful:            160842
Success Rate:          99.421%       Success Rate:          99.551%
========== UPLOAD =============      ========== UPLOAD =============
Rejected:              0             Rejected:              0
Acceptance Rate:       100.000%      Acceptance Rate:       100.000%
---------- accepted -----------      ---------- accepted -----------
Failed:                0             Failed:                7
Fail Rate:             0.000%        Fail Rate:             0.006%
Canceled:              82            Canceled:              1580
Cancel Rate:           0.056%        Cancel Rate:           1.385%
Successful:            145809        Successful:            112525
Success Rate:          99.944%       Success Rate:          98.609%
========== REPAIR DOWNLOAD ====      ========== REPAIR DOWNLOAD ====
Failed:                0             Failed:                0
Fail Rate:             0.000%        Fail Rate:             0.000%
Canceled:              0             Canceled:              0
Cancel Rate:           0.000%        Cancel Rate:           0.000%
Successful:            1643          Successful:            5592
Success Rate:          100.000%      Success Rate:          100.000%
========== REPAIR UPLOAD ======      ========== REPAIR UPLOAD ======
Failed:                0             Failed:                0
Fail Rate:             0.000%        Fail Rate:             0.000%
Canceled:              31            Canceled:              416
Cancel Rate:           2.034%        Cancel Rate:           11.082%
Successful:            1493          Successful:            3338
Success Rate:          97.966%       Success Rate:          88.918%
========== DELETE =============      ========== DELETE =============
Failed:                0             Failed:                0
Fail Rate:             0.000%        Fail Rate:             0.000%
Successful:            0             Successful:            0
Success Rate:          0.000%        Success Rate:          0.000%

It appears uploads success rate took the hit in both cases. I’m not sure why: sync writes are disabled on the dataset. Maybe there is some delay communicating the completion, because hashstore needs to do more maintenacne on each write? Both nodes are on zfs arrays with special device and plenty of ram.

In the meantime I’ll set the storage2migration.suppress-central-migration: true flag on my other nodes for now, until this is triaged and addressed.

Perhaps an interesting workaround would be to have stuff uploaded to piece store, and then migrated to hashstore. But if piecestore is to be retired – the regerssion needs to be fixed: once everyone migrates to hashstore the aparent success rates will recover on their own, as competing nodes will become just as slow – but it’s not good for the network.

3 Likes

Best of both worlds? That’s an interesting idea if you put piecestore on an SSD just for uploads. Would be easiest to implement for a SNO.

So everything is slower then? But Hashstore was supposed to be faster.

With zfs it’s better – it goes to ram, and written to disk periodically as one transaction.

The way I understand it, it is faster in the sense that it is supposed to scale better. Perhaps it has a finite fixed latency, that is higher than that of piece store, but under very heavy load, when that latency becomes irrelevant, it manages fine (linearly?). And piecestore can be very efficient at relatively low load, but chokes under high load (logarithmically?).

In the same way people say ext4 is much faster than zfs on small number of files on crappy hardware, but zfs can shovel trillions of files without batting an eye, while ext4 would long choke. I.e. zfs can be slower on 2GB raspberry pi than ext4, but it’s still faster on every imaginable metric on real load.

I feel that since it was developed for select network, that what mattered there.

Everying above is 100% speculation.

My next experiment would be to configure that same node as:

PassiveMigrate":true,"WriteToNew":false,"ReadNewFirst":false,"TTLToNew":false

and keep the migrate chore on, and see if the sucess rate recovers.

1 Like

Isn’t this what was intended with the filestore.write-buffer-size setting?

If not then yes to ram would be fastest. But you’ll need the ram. Otherwise you could put piecestore on a (cheap but fast) NVMe and it would be still very fast and get faster as this technology evolves. For SNOs this would be most easy thing to do and does not require change in setup other than installing a SSD.

I feel the same. But the Global network is nowhere near (maybe never will) to what the Select network is handling. According to the initial post:

That could indeed yield interesting results.

Not really – the node with piecestore on conventional filesytem still needs to write mutliple files, and it’s inherently slow and IO bound

ZFS batches writes into a periodic transactions, thus avoidign the random IO.

Speaking on transactions, I’ve just noticed I had a very small record size on those datasets, appropriate for piece store but overkill for hasstore. I’m going to set it to 16M, zfs send/receive, and then monitor for a day, to see if that has any impact. It should not – but for the purity of experiment.

3 Likes

Normal uploads stats are almost the same. It’s only a large drop in “repair uploads”, which to me doesn’t indicate a fs or memory bottleneck. More a difference in “code” between upload and repair upload.

1 Like

I also observe successrate degradation after migration to hashstore completed.

Thanks for the report. Unfortunately it’s quite hard to investigate without more data. Especially as we see different data

  • It’s not clear how do you collect data (from logs? from prometheus metrics?)
  • It’s not clear what’s happening in the background, any IO load on the node? Compaction times?

Do you have Prometheus or similar monitoring (Thanos/Victoria/…) with historical data?

Did you enable telemetry with your nodes (that would give me a chance to see data, even if it’s hard to convert it to usable data). At least if you share the node_id.

If you share node_id, I can also check the Satellite level cancellation rate (in the code it’s called success_rate/failure_rate, and it’s very similar).

I am open to join to call, and check monitoring.

In case of Prometheus monitoring, let me share some example metrics what we use:

Active compaction for all the servers.

sum by (...) (function{name="__Store__Compact",scope="storj_io_storj_storagenode_hashstore",field="current"})

I also like the compaction details, but this can be used only for one instance:

hashstore{scope="storj_io_storj_storagenode",field="Compaction_TotalRecords"}

(Compaction is executed in multiple phases. Here the last compaction took ~7 hours. The first compaction decreased the number of records (Y), the next 5 re-compressed log files.

Rate of these compactions can also be interesting:

rate(hashstore{scope="storj_io_storj_storagenode",field="Compaction_ProcessedRecords"}[2m])

One randomly picked servers shows me 3-4k for longer compactions.

I would rather turn off the migration. Migration generates more IO load. You wrote it has been finished. On ext4 even the finished migration can cause problems, but you are on ZFS. So not sure, but I would give it a try :thinking:

2 Likes

I’m just feeding the log file, that rotates daily, to GitHub - ReneSmeekes/storj_success_rate. So it’s aggregate of past 21-22 hours of node operation.

This server hosts a bunch of other stuff, but the IO on the disks rarely exceeds 60 iops. This is a zfs pool of 2x raidz1, each comprizing of 4 disks, plus a special device which is a mirror of two SATA SSDs.

Where can I look it up?

No, but i have daily rotated log for the past month.

There is a bunch of settings. Which ones do I need to uncomment?

# address(es) to send telemetry to (comma-separated)
# metrics.addr: collectora.storj.io:9000

# application name for telemetry identification. Ignored for certain applications.
# metrics.app: storagenode

# application suffix. Ignored for certain applications.
# metrics.app-suffix: -release

# address(es) to send telemetry to (comma-separated IP:port or complex BQ definition, like bigquery:app=...,project=...,dataset=..., depends on the config/usage)
# metrics.event-addr: eventkitd.datasci.storj.io:9002

# size of the internal eventkit queue for UDP sending
# metrics.event-queue: 10000

# instance id prefix
# metrics.instance-prefix: ""

# how frequently to send up telemetry. Ignored for certain applications.
# metrics.interval: 1m0s

# maximum duration to wait before requesting data
# nodestats.max-sleep: 5m0s

# how often to sync reputation
# nodestats.reputation-sync: 4h0m0s

# how often to sync storage
# nodestats.storage-sync: 12h0m0s

# operator email address

12nRLLozTqKdD5KRjLu23C8Xz6ZKxkzngpRhoKUZtfruDcCMpar

It’s connected via nearest endpoint of AirVPN (Vancouver); I don’t have public IP on that server.

Let’s exhaust what we can do in the forum, it’s 1AM here…

I can try setting up Prometheus, but it will take some time.

Right, I meant to have new data to go to piece store, and then have migrration chore put it to hashstore periodically – suspecting that with the hashstore the node may be taking longer to acknowledge receipt of data to the client, for whatever reason. Or do I misunderstand the process?

Another experiment I coudl try is to configure passive migration on the other node, and see if its successrate dunks. That shall be pretty easy and safe to do for a day.

1 Like