Updates on Test Data

littleskunk · July 1, 2024, 3:26pm

Yes that sounds like the outcome of the trash cleanup bug. The bug is fixed but it requires a full used space filewalker run to correct the numbers. And the same is needed for the new TTL cleanup bug. So maybe postpone the used space filewalker runs a bit.

As a workaround I will just tell my nodes that they have a 50TB drive each. That way they don’t stop. There should be a free space check that should still limit the size to the actual HDD size but it would ignore the size the node believes it has and uses that free space on disk instead.

striker43 · July 1, 2024, 3:37pm

So if we all do this, this will mess up your capacity planning a lot I guess

jammerdan · July 1, 2024, 3:37pm

I doubt that all operators of those 3000 nodes will read this workaround.

littleskunk · July 1, 2024, 4:44pm

No it doesn’t. The node will report the amount of free space on disk. If you set allocated space to some extreme high value you would see this in the logs:

2024-07-01T18:43:18+02:00       WARN    piecestore:monitor      Disk space is less than requested. Allocated space is   {"Process": "storagenode", "bytes": 16782805938716}

snorkel · July 1, 2024, 4:54pm

So how sure are you we won’t destroy our nodes? I thought about this workaround too, but I didn’t risk it. What if there is again some bug that will ignore the 5GB limit?

zip · July 1, 2024, 4:56pm

How reliable is this?
Because I had nodes crashing just recently with drive out of space and I had to delete the trash manually to free up the space be able to restart the node.
Databases were on SSD, space was set correctly, it however was not accounted correctly I assume because of some bug.
These are running on LVM and in some cases not all of the drive is allocated for the node - might this be the reason?

littleskunk · July 1, 2024, 5:04pm

I was talking about my nodes. They have plenty of free space available and are not getting full in the next few weeks.

If you want some extra protection you can use a code snipet like this:

diskspace=`curl -s 127.0.0.1:1500$i/api/sno/ | jq .diskSpace`
used=`echo $diskspace | jq .used`
free=[run some df command to find out free space here]
trash=`echo $diskspace | jq .trash`
sed -i "s/storage\.allocated-disk-space: .*/storage.allocated-disk-space: $(($used+$trash+free)) B/g" [path to your storagenode config]

Maybe substract like 100GB from the free space and than you have your safety margin.

littleskunk · July 1, 2024, 5:08pm

I don’t understand. The workaround I was talking about was for nodes that believe they are full when they still have a few TB free space on disk. If your node is crashing because it is full than you might want to reduce your allocation. Why would you want to do the oposit? That would make your situation worse right?

zip · July 1, 2024, 5:18pm

If there is a free space check to determine the actual free space on the drive then the nodes shouldn’t be crashing because of drive being full.
My understanding is this was changed to 5GB recently, so the node should stop accepting uploads as the actual drive free space approaches 5GB of space left.
And I’m saying I had nodes that had less space allocated than what was the actual drive size and yet they crashed because the drive ran out of space.

littleskunk · July 1, 2024, 5:24pm

Your filesystem might return incorrect data to the storagenode. In that case it would continue accepting more pieces than it should have.

Don’t write other data on the same drive. This sounds more like a topic that was discussed in a few other threads already. Just follow the advice you can read up there. @Alexey might even have some links for you.

Vadim · July 1, 2024, 5:24pm

Did you had database on same HDD or only data?

zip · July 1, 2024, 5:32pm

Databases are on separate SSDs, LVM caches are on separate SSDs and the drive was being used only for single storagenode and for nothing else, using ext4 with default settings except of -m 0. As a rule of thumb I leave around 5% of the drive (more specifically LV) as unused.
So the conclusion would be this free space check isn’t reliable, as hasn’t been the trash accounting, the BFs and so on. Or maybe it is just me and in that case please accept my apology.

snorkel · July 1, 2024, 5:58pm

The autostop for 5GB works with the space tracked by storagenode. All my full nodes show free space 4.46GB. Regarding the space reported by OS, I didn’t got there yet, to test it.
I have also ext4 drives, but they have Synology DSM on them, so the drives are used by other software too. Because of this, I could never allocate full drive.
Maybe on my Ubuntu machine I could test this, but to risk 22TB of data?

arrogantrabbit · July 1, 2024, 6:24pm

What’s the risk? Data can either be written to the volume or not.

flo82 · July 1, 2024, 6:32pm

wait for this storagenode/blobstore: blobstore with caching file stat information (… · storj/storj@2fceb6c · GitHub

Mitsos · July 1, 2024, 6:34pm

Thanks, but doesn’t help me:

filestat cache is incompatible with lazy file walker. Please use --pieces.enable-lazy-filewalker=false

flo82 · July 1, 2024, 6:37pm

If the performance tests are real - then everyone will switch to “non-lazy” filewalker.
If the (cache) database sits on a ssd, you’ll save an massive amount of I/Os.

Alexey · July 2, 2024, 4:41am

That was for a trash filewalker. This one is something new - the TTL data removing doesn’t update the usage as was with the trash filewalker.

MarviBiene · July 2, 2024, 4:59am

Wait, so there is free space from the deleted ttl data, but it’s not updated to the node itself? So it’s still not accepting data? Will it fix itself then or do we have to do something? Is there a workaround available, if it doesn’t fix itself?

snorkel · July 2, 2024, 5:46am

So even the ver 107 won’t fix this bug?