Updates on Test Data

Yes that sounds like the outcome of the trash cleanup bug. The bug is fixed but it requires a full used space filewalker run to correct the numbers. And the same is needed for the new TTL cleanup bug. So maybe postpone the used space filewalker runs a bit.

As a workaround I will just tell my nodes that they have a 50TB drive each. That way they don’t stop. There should be a free space check that should still limit the size to the actual HDD size but it would ignore the size the node believes it has and uses that free space on disk instead.

3 Likes

So if we all do this, this will mess up your capacity planning a lot I guess :smiley:

I doubt that all operators of those 3000 nodes will read this workaround.

1 Like

No it doesn’t. The node will report the amount of free space on disk. If you set allocated space to some extreme high value you would see this in the logs:

2024-07-01T18:43:18+02:00       WARN    piecestore:monitor      Disk space is less than requested. Allocated space is   {"Process": "storagenode", "bytes": 16782805938716}
3 Likes

So how sure are you we won’t destroy our nodes? I thought about this workaround too, but I didn’t risk it. What if there is again some bug that will ignore the 5GB limit?

How reliable is this?
Because I had nodes crashing just recently with drive out of space and I had to delete the trash manually to free up the space be able to restart the node.
Databases were on SSD, space was set correctly, it however was not accounted correctly I assume because of some bug.
These are running on LVM and in some cases not all of the drive is allocated for the node - might this be the reason?

I was talking about my nodes. They have plenty of free space available and are not getting full in the next few weeks.

If you want some extra protection you can use a code snipet like this:

diskspace=`curl -s 127.0.0.1:1500$i/api/sno/ | jq .diskSpace`
used=`echo $diskspace | jq .used`
free=[run some df command to find out free space here]
trash=`echo $diskspace | jq .trash`
sed -i "s/storage\.allocated-disk-space: .*/storage.allocated-disk-space: $(($used+$trash+free)) B/g" [path to your storagenode config]

Maybe substract like 100GB from the free space and than you have your safety margin.

1 Like

I don’t understand. The workaround I was talking about was for nodes that believe they are full when they still have a few TB free space on disk. If your node is crashing because it is full than you might want to reduce your allocation. Why would you want to do the oposit? That would make your situation worse right?

If there is a free space check to determine the actual free space on the drive then the nodes shouldn’t be crashing because of drive being full.
My understanding is this was changed to 5GB recently, so the node should stop accepting uploads as the actual drive free space approaches 5GB of space left.
And I’m saying I had nodes that had less space allocated than what was the actual drive size and yet they crashed because the drive ran out of space.

Your filesystem might return incorrect data to the storagenode. In that case it would continue accepting more pieces than it should have.

Don’t write other data on the same drive. This sounds more like a topic that was discussed in a few other threads already. Just follow the advice you can read up there. @Alexey might even have some links for you.

Did you had database on same HDD or only data?

Databases are on separate SSDs, LVM caches are on separate SSDs and the drive was being used only for single storagenode and for nothing else, using ext4 with default settings except of -m 0. As a rule of thumb I leave around 5% of the drive (more specifically LV) as unused.
So the conclusion would be this free space check isn’t reliable, as hasn’t been the trash accounting, the BFs and so on. Or maybe it is just me and in that case please accept my apology.

The autostop for 5GB works with the space tracked by storagenode. All my full nodes show free space 4.46GB. Regarding the space reported by OS, I didn’t got there yet, to test it.
I have also ext4 drives, but they have Synology DSM on them, so the drives are used by other software too. Because of this, I could never allocate full drive.
Maybe on my Ubuntu machine I could test this, but to risk 22TB of data? :thinking:

What’s the risk? Data can either be written to the volume or not.

wait for this storagenode/blobstore: blobstore with caching file stat information (… · storj/storj@2fceb6c · GitHub

1 Like

Thanks, but doesn’t help me:

filestat cache is incompatible with lazy file walker. Please use --pieces.enable-lazy-filewalker=false

If the performance tests are real - then everyone will switch to “non-lazy” filewalker.
If the (cache) database sits on a ssd, you’ll save an massive amount of I/Os.

That was for a trash filewalker. This one is something new - the TTL data removing doesn’t update the usage as was with the trash filewalker.

Wait, so there is free space from the deleted ttl data, but it’s not updated to the node itself? So it’s still not accepting data? Will it fix itself then or do we have to do something? Is there a workaround available, if it doesn’t fix itself?

So even the ver 107 won’t fix this bug? :man_facepalming:t2:

3 Likes