Cleanup before Hashstore and migration to it

Yes, it should be. But others reported that if you restart, your node likely will double account the used storage, so it may report itself to the satellites, that’s full, so it could stop an ingress.
The workaround is known - delete the prefixes database and restart.
It’s not fully confirmed though.

If we could confirm, we can submit a bug.

Maybe some just don’t notice the issue they have so much space and assigned space that temporary overusage does not stop ingress completely.

Maybe. I do not know exactly. And you, likely - too. We need an exact path to reproduce. Then it will be fixed immediately.

For me:
Node slowly grew in size whilst conversion was running.
A restart doubled the reported size. Subsequent restart added the original size of node again. Ingres stopped when node size exceeded max node size.
Stop node. Delete used_by_prefix database. Start node. Node size returned to its physical size.

After Migration do you still recommend:

  1. The lazy fileworker: # run garbage collection and used-space calculation filewalkers as a separate subprocess with lower IO priority
    pieces.enable-lazy-filewalker: true
  2. The old startup-scan: # if set to true, all pieces disk usage is recalculated on startup
    storage2.piece-scan-on-startup: true

Both should be set to true or? Is the startup-scan also going trough all the hashstore-logs and restoring the databases from it?
And the lazy-filewalker is already working with the hashstore-logs?

On some other STORJ-Node without migration I also see so much trash unfortunately. I now deleted the used_space_per_prefix.db, for sure backed it up in another folder, and it is new recreated. Also running startup2-scan and lazy-filewalker-scan on it.
grafik

Both can be in any state or commented out to have their defaults.

What do you call trash? The value on the dashboard or the folder on the disk?
Could you please compare the size of the trash folder with the value on the dashboard?

The trash on the dashboard. Am checking now the used folders on the disk with
du -h --max-depth=1
but it takes some time…will come back soon.

Remember that after migration, trash is not stored in the trash folder, but part of the hashstore files. So counting that path will likely just give you part of the complete used space for trash.

So the folder trash would also not be needed anymore. Do you know if piece_expirations is still being used? It should also be obsolete, or?

I assume it would become obsolete once migration is completed. It’s handled by hashstore grouping similar expiration date together.

Another one who noticed that can be investigated:

This new hashstore is super great. I feel it from the beginning at the first spot.

You can even track the active migration success by typing " du -h --max-depth=1 /mnt/STORJ-1/storage/hashstore/" and then it is counting the files.

Now I switched all my nodes to passive hashstore. Some are passive and active and rewriting the files. Unfortunately it does take months…But yeah it is what it is and there is no other way.

Nice, practice tipp for the migration. Don’t delete the blobs-folder, just the content of it. There will be an error message as of now:

grafik

But the solution is also easy, just manually recreate the blobs folder in node/storage.

Yes.. you did go a little overboard on the spring-cleaning :wink:

Unfortunately I’m also seeing a higher cancel-rate on the hashstore nodes.

Here two new nodes with piecestore:

docker logs STORJ-1 2>&1 | grep uploaded | grep -v REPAIR | wc -l
docker logs STORJ-1 2>&1 | grep uploaded | grep REPAIR | wc -l
docker logs STORJ-1 2>&1 | grep “upload canceled” | grep -v REPAIR | wc -l
docker logs STORJ-1 2>&1 | grep “upload canceled” | grep REPAIR | wc -l

docker logs STORJ-2 2>&1 | grep uploaded | grep -v REPAIR | wc -l
docker logs STORJ-2 2>&1 | grep uploaded | grep REPAIR | wc -l
docker logs STORJ-2 2>&1 | grep “upload canceled” | grep -v REPAIR | wc -l
docker logs STORJ-2 2>&1 | grep “upload canceled” | grep REPAIR | wc -l

Here the results:

2367
572
2
3
5125
3616
1
6

Basicly 0% cancel-rate. Now with two newer nodes, full on hashstore:

3008
742
71
23

2179
1188
64
33

As you can see it’s 2% to 3%. Most of my nodes are 2% to 3%, didn’t experience 10% to 30% yet.

All the four new nodes are on SSD’s so it’s not related to some probably poor HDD I/O performance.

Now I want to activate the memtbl.

Just put this into the config file:
hashstore.table-default-kind=memtbl
hashstore.memtbl.mmap=true
hashstore.memtbl.max-size=128MiB
hashstore.compaction.rewrite-multiple=10
hashstore.compaction.probability-power=2

Can some explain the parameters? It should take 1,3 GB of RAM per TB.

Probably the memtbl solves the cancel-rate. But it’s more to doubt as the SSD I/O performance is very high and the SSD in iorate kinda idle.

Somehow I’m getting READ FPDMA QUEUED error messages in the journal:

Should I disable the NCQ? It’s likely related to that, by asking ChatGPT.

That’s why you don’t rely on AI for actual troubleshooting.

Your disk returned an IO error (second to last line). Your disk and/or cable is failing to return proper data. Post your complete SMART so we can troubleshoot it.

AI ist super grat for knowledge-gaining and trouble-shooting. Hopefully it stays and advances, even though it’s maintenance cost for the maintainer should be very high.

The SMART-Values are looking fine:

smartctl -x /dev/sdb