Hashstore rollout commencing!

Alexey · August 28, 2025, 3:13am

If you want to use variables, then any option can be converted to a variable by adding a prefix STORJ_ and replacing all dashes and dots to the underscores.

This is answered in the first post:

You may also just add this option to the config.yaml file, i.e.:

storage2migration.suppress-central-migration: true

Save the config and restart the node.

jammerdan · August 28, 2025, 5:02am

That’s an interesting observation. I don’t know if that’s the cause for my nodes. However I am seeing a strange filewalker behavior after restart. It seems that it does not complete:

docker logs storagenode | grep -E "filewalker|used|reaped"
2025-08-28T02:55:49Z    INFO    pieces  used-space-filewalker started   {"Process": "storagenode", "Satellite ID": "1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE"}

And this was 2 hours ago where it would normally complete within something like minutes due to badger cache.
Debug log shows migration is running and size data is removed from badger cache.

I cannot yet tell if this is a coincidence though but maybe something prevents filewalker from finishing after a restart causing an issue with the used space calculation.

Alexey · August 28, 2025, 5:07am

What would happen if you remove a prefix database and restart? would it finish?

jammerdan · August 28, 2025, 5:24am

Do you mean this one used_space_per_prefix.db?

Can I remove it and run the node without recreating it first?

I am just seeing another log entry:

2025-08-28T05:17:05Z    DEBUG   filewalker      resetting progress in database     {"Process": "storagenode"}

I don’t know if this is expected.

mike · August 28, 2025, 6:30am

I’ve had similar issues with double space counting on all nodes I’ve migrated. When they’ve become artificially filled, my fix has been to delete that file (used_space_per_prefix.db) and restart node for a fresh Filewalker run. Always fixed it, never seen an issue.

You could wait 1-2 weeks as well, it will fix itself once a grace period of the most recent completed run is reached. But.. I’m not patient enough to wait for ingress

jammerdan · August 28, 2025, 6:51am

My log has just updated and is showing a completed filewalker run for satellite 1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE now. So it does not seem to be an issue with filewalker hanging or something. But obviously it took much longer than usual possibly due to the ongoing migration.

I think I will let filewalker complete on all satellites before removing the prefix database and first check if the completed filewalker has any effect and corrects the used space back to what is really in use.

I will try that if the current filewalker run will not correct it.

jammerdan · August 28, 2025, 7:27am

Really weird. The other filewalkers have finished already at usual speeds:

"1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE", "Duration": "3h44m56.353029932s"}
"121RTSDpyNZVcEU84Ticf2L1ntiuUimbWgfATz21tuvgk3vzoA6", "Duration": "5m38.895295216s"
"12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs", "Duration": "219.279928ms"}
"12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S", "Duration": "50.982391303s"

There is no correction in used space at this time. I don’t know how long this might take.
Given the speed the remaining filewalkers have been executed it has to be coming from the badger cache.
So I guess I could try 2 things: Disable badger cache and restart or remove prefix database. As per @mike deleting used_space_per_prefix.db will probably fix it, so maybe I try disabling badger first…

jammerdan · August 28, 2025, 12:49pm

Badger has been disabled with --pieces.file-stat-cache="" in the run command.
Surprisingly, filewalking has already been finished.

Used space did not correct itself, it got even worse. Now the “free” space turned from 5Gb to -33GB.

So next step is to delete the used_space_per_prefix.db or maybe rename it.

snorkel · August 28, 2025, 5:41pm

If my memory serves me well, the filewalker dosen’t start again in an hour period… or it was more?

mike · August 28, 2025, 8:27pm

It uses used_space_per_prefix.db as a cache, if it sees it and it’s not corrupted or otherwise unusable, it will trust it blindly. So in theory it shouldn’t ever fix itself, but we know this is not true either, as most reports point to the fact that in time (1-2 weeks) from completed migration the numbers (cache file) are updated with correct sizes. Maybe it’s related to garbage collector and compactions?

Also, usage numbers doubles during active migration, because this cache is just appended with “new” files being stored in the hashtable during migration, but it doesn’t pick up on the corresponding piecestore data being deleted.

And because used_space_per_prefix.db is a cache by nature, it should be completely safe to remove it and let the node rebuild it’s actual and true stored usage. The heavy IO is calculating piecestore usage, not the hashstore part.

jammerdan · August 29, 2025, 1:27am

Which is weird because in the logs we see message like DEBUG blobscache deleted piece {"Process": "storagenode", "Satellite ID": "121RTSDpyNZVcEU84Ticf2L1ntiuUimbWgfATz21tuvgk3vzoA6", "disk space freed in bytes": 1792} just before the log line that exactly this piece with that size has been migrated to hashstore.
So there should be no reason the used space data could not be updated accordingly.

However the whole thing goes like this in the logs:

2025-08-29T03:17:15Z    DEBUG   db      file not found; nothing to remove       {"Process": "storagenode", "path": "storage/blobs/qstuylguhrn2ozjv4h2c6xpxykd622gtgurhql2k7k75wqaaaaaa/o2/zmr3e3kqvvowlchh3553nwowbdv32sgsdvkxiq6watm2edt6fa"}
2025-08-29T03:17:15Z    DEBUG   blobscache      deleted piece   {"Process": "storagenode", "Satellite ID": "121RTSDpyNZVcEU84Ticf2L1ntiuUimbWgfATz21tuvgk3vzoA6", "disk space freed in bytes": 18944}
2025-08-29T03:17:15Z    DEBUG   piecemigrate:chore      migrated a piece        {"Process": "storagenode", "sat": "121RTSDpyNZVcEU84Ticf2L1ntiuUimbWgfATz21tuvgk3vzoA6", "id": "O2ZMR3E3KQVVOWLCHH3553NWOWBDV32SGSDVKXIQ6WATM2EDT6FA", "size": 18944, "took": "238.359119ms"}

and maybe the part where it says it does not find the file in the database is the reason that the data does not get updated correctly? I don’t know.

jammerdan · August 29, 2025, 2:35am

Shall I be concerned regarding such error messages:

couldn't migrate        {"Process": "storagenode", "error": "getting the piece header: pieces error: PieceHeader framing field claims impossible size of 64091 bytes"

couldn't migrate        {"Process": "storagenode", "error": "hash mismatch:

couldn't migrate        {"Process": "storagenode", "error": "opening the old reader: pieces error: invalid piece file for storage format version 1: too small for header (0 < 512)"

mike · August 29, 2025, 3:27am

On an 8-10TB node I see 10-50 such errors. Mostly caused by files in piecestore being 0 bytes.

Alexey · August 29, 2025, 6:41am

These are corrupted pieces, and as already answered by @mike, if the number is not significant, it probably will not affect the audit score too much.

ACarneiro · August 29, 2025, 1:26pm

snorkel · August 29, 2025, 8:50pm

I wonder why so many operators report 0 byte pieces? Can’t be that everyone has bad drives and corrupted pieces. Maybe they are corrupted pieces, but not because of hard drive beeing in bad shape. Maybe they are corrupted by other causes. What could those be?
Is it that the storagenode has some bug that records 0 byte pieces in some special circumstances?

Toyoo · August 29, 2025, 9:16pm

I recall there were some cases which would leave temporary files on disk if a node lost race while shutting down, or doing some other chores. Harmless to the network, and as the piece was not registered by satellite as stored by the node, it would at some point be garbage-collected anyway, so harmless to the node as well. The alternative would be to do some additional I/O per each upload, so an objectively worse behavior.

jammerdan · August 30, 2025, 9:28am

Deleting that file and finishing filewalkers has freed some space.

snorkel · August 30, 2025, 3:01pm

Is the rollout started? Or is it coming in ver.136?

Vadim · August 30, 2025, 4:22pm

migration is started