Yes that sounds like the outcome of the trash cleanup bug. The bug is fixed but it requires a full used space filewalker run to correct the numbers. And the same is needed for the new TTL cleanup bug. So maybe postpone the used space filewalker runs a bit.
As a workaround I will just tell my nodes that they have a 50TB drive each. That way they don’t stop. There should be a free space check that should still limit the size to the actual HDD size but it would ignore the size the node believes it has and uses that free space on disk instead.
No it doesn’t. The node will report the amount of free space on disk. If you set allocated space to some extreme high value you would see this in the logs:
2024-07-01T18:43:18+02:00 WARN piecestore:monitor Disk space is less than requested. Allocated space is {"Process": "storagenode", "bytes": 16782805938716}
So how sure are you we won’t destroy our nodes? I thought about this workaround too, but I didn’t risk it. What if there is again some bug that will ignore the 5GB limit?
How reliable is this?
Because I had nodes crashing just recently with drive out of space and I had to delete the trash manually to free up the space be able to restart the node.
Databases were on SSD, space was set correctly, it however was not accounted correctly I assume because of some bug.
These are running on LVM and in some cases not all of the drive is allocated for the node - might this be the reason?
I don’t understand. The workaround I was talking about was for nodes that believe they are full when they still have a few TB free space on disk. If your node is crashing because it is full than you might want to reduce your allocation. Why would you want to do the oposit? That would make your situation worse right?
If there is a free space check to determine the actual free space on the drive then the nodes shouldn’t be crashing because of drive being full.
My understanding is this was changed to 5GB recently, so the node should stop accepting uploads as the actual drive free space approaches 5GB of space left.
And I’m saying I had nodes that had less space allocated than what was the actual drive size and yet they crashed because the drive ran out of space.
Your filesystem might return incorrect data to the storagenode. In that case it would continue accepting more pieces than it should have.
Don’t write other data on the same drive. This sounds more like a topic that was discussed in a few other threads already. Just follow the advice you can read up there. @Alexey might even have some links for you.
Databases are on separate SSDs, LVM caches are on separate SSDs and the drive was being used only for single storagenode and for nothing else, using ext4 with default settings except of -m 0. As a rule of thumb I leave around 5% of the drive (more specifically LV) as unused.
So the conclusion would be this free space check isn’t reliable, as hasn’t been the trash accounting, the BFs and so on. Or maybe it is just me and in that case please accept my apology.
The autostop for 5GB works with the space tracked by storagenode. All my full nodes show free space 4.46GB. Regarding the space reported by OS, I didn’t got there yet, to test it.
I have also ext4 drives, but they have Synology DSM on them, so the drives are used by other software too. Because of this, I could never allocate full drive.
Maybe on my Ubuntu machine I could test this, but to risk 22TB of data?
If the performance tests are real - then everyone will switch to “non-lazy” filewalker.
If the (cache) database sits on a ssd, you’ll save an massive amount of I/Os.
Wait, so there is free space from the deleted ttl data, but it’s not updated to the node itself? So it’s still not accepting data? Will it fix itself then or do we have to do something? Is there a workaround available, if it doesn’t fix itself?