today, I checked my nodes, and found out, one of them is filled to the brim (regarding debian), when checking the dashboard, it also shows, that it’s filled, it even stopped accepting new inbound traffic. But the dashboard shows, that the satellites report much less used space. This is since almost a year, when the test were done. I thouhgt, let it run, it’ll fix itself during time. Other nodes deleted their trash flawlessly. But with this node, there might to be an issue. In this time, I restarted the node several times due to updates etc. piece scan is enabled. What can I do, to fix this? Its an 18TB/16.2TiB HDD serving not even 6TB for Storj, since there seem to be undeleted trash, not giving it’s full potential. I also checked the trash folder, nothing special, no old undeleted data. Maybe the data is even uncollected? TiA
Almost exactly 5 GB left, which matches the node’s safeguard code not to take all disk space. So this part works.
The node believes it stores 16.1 TB, which also is close enough to actual disk usage, so good.
So indeed it does look like some failure of the garbage collector process. Yep, the next step would be to try looking for any clues of a misbehaving collector. Do you have any files in the config/retain subdirectory of your node? Could you try collecting a week (that’s roughly how often bloom filters are sent) worth of storage node logs and look for any reported problems around collectors?
Minor thing: the file system has 5% of disk space reserved for root use. You likely don’t need this, and so you can tune it down to, let say, 1%. You can change this with tune2fs -m 1.
So your node does receive bloom filters. Good. If you are not observing disk space freeing up right now, then for some reason they are not executed. You need to look into logs to find out the reason.
Start and let it finish all piecescans. Watch the retain logs and trash logs to show finished or success.
It will clean itself in a week or two… maybe.
Can’t hurt to try it.
I’ve done this on all my nodes, just to get rid of huge useless databases.
The walkers will create new db-es up-to-date, and bloom filters will clean the trash. Hopefully.
I have a couple of nodes that are both slow and mounting even slower NFS drives for their data. my “retain” bloom filters would often fail, especially during the high load august (but I have one that’s failing right now). eventually it finished running the retain process and moving items to garbage.
If the retain/garbage collection is the problem my vote would be:
use badger cache (not a dramatic help, but maybe a little)
disable lazy (becuase they just plain fail when the storage backend is slow)
check your config.yaml for retain.concurrency: 1 (if it’s more than 1, make it 1 so the system doesn’t try multiple at once)
if your badger cache is already populated, maybe disable piece scan on startup. because there is probably a retain job waiting to run and the used space filewalker would just slow it down.
(if you are turning on badger cache for the first time, then you will need to run a full piece scan used space filewalker to populate it)
It’s set to “nothing” so everything gets logged. I set the logfile size to limit the filesize. How do I change that, to just get the messages I need, and not every up- and download? This maybe would make it easier to debug and find the problem.
I forgot to mention, and I’ll edit my post (again);
when deleting db-es to recreate them with a new scan, you should also delete those 2 extra dirrectories: the filestore cache (if the badger was already on) and the piecestore expiration.
A successul piecescan will recreate badger dir, expirations dir and all db-es.
If you can’t run a successful piece scan, or a successful retain after removing all that, your setup has a problem… to slow, usb connections, controllers, I don’t know… maybe disk dieing.
I could run a successful piecescan on 6TB node, on a machine with 1GB RAM and 2 nodes running. So not the compute power is to blame.