It seems the storagenode has hard time staying within limits, and very loosely interprets amount of allocated space, constantly overshooting the set limit. I don’t want to babysit it and dynamically stop ingress by lowering the allocated space whenever I see it using too much.
I’ve read recently that there is safety buffer of 5GB of free space that storagenode will maintain no matter what.
So, what if I set quota on the dataset to the amount I actually want storagenode not to exceed, the same amount I told it actually in the config file, as a way to enforce the agreed upon amount of configured space.
I’d just config it to save a bit more space: so there’s some room to go over. If your quota prevents it from writing… don’t nodes have a periodic-storage-writeability check: and if that fails (or takes longer than a minute) the node shuts down?
You can try a quota: but don’t be surprised if your node periodically stabs itself in the heart
No, it should work, at least all SNOs with not updated usage in the databases reported the same, it stops ingress around 5GB.
And seems your node have this issue with not updated databases too (the data on the piechart does not match the disk usage).
Could you please check?
I checked, there are no database related issues in the logs. There are a few filewalkers that did not complete, but it’s expected as I rebooted node after changing configuration and on update.
In fact, few days after I artificially lowered the allocated space to stop ingress, it’s lagging in the other direction: pie chart shows 27.8TB used, while data on disk is 26.4TB. (The limit was never set higher than 27TB)
It seems there in some lag, it takes a while for the dashboard to catch up with the actual disk usage and until it does, node can overshoot during high ingress.
Perhaps used space filewalker needs to run more often to reduce the discrepancy?
Either way, I’ll set quotas and this shall take care of this corner case.
You are correct, the databases are updated regularly, but the dashboard re-read them not so often, so there is always a lag. And I think the overusage problem could be that the node didn’t transmit information to the satellites in time that it was full.
It could be. However, in the lazy mode it would run days especially when you have a high ingress - the filewalker will be de-prioritized by the OS.
I think there could help to reduce intervals for the backend to re-read data from the databases, but I do not think that we have it parametrized.