Issue with not deleting trash since months

gingerbread233 · January 27, 2025, 7:37pm

Hello,

today, I checked my nodes, and found out, one of them is filled to the brim (regarding debian), when checking the dashboard, it also shows, that it’s filled, it even stopped accepting new inbound traffic. But the dashboard shows, that the satellites report much less used space. This is since almost a year, when the test were done. I thouhgt, let it run, it’ll fix itself during time. Other nodes deleted their trash flawlessly. But with this node, there might to be an issue. In this time, I restarted the node several times due to updates etc. piece scan is enabled. What can I do, to fix this? Its an 18TB/16.2TiB HDD serving not even 6TB for Storj, since there seem to be undeleted trash, not giving it’s full potential. I also checked the trash folder, nothing special, no old undeleted data. Maybe the data is even uncollected? TiA

alpharabbit · January 27, 2025, 7:40pm

Any error messages? File system checked?

Toyoo · January 27, 2025, 10:47pm

Please show the output of df /path/to/storage (exactly, no other flags like -h), and tune2fs -l /dev/sdf1.

gingerbread233 · January 28, 2025, 3:41pm

df /srv/dev-disk-by-uuid-859f8a89-643c-4af4-8420-f0c38afd7415/storj_1
Dateisystem 1K-Blöcke Benutzt Verfügbar Verw% Eingehängt auf
/dev/sdf1 17301331420 16417225560 5173280 100% /srv/dev-disk-by-uuid-859f8a89-643c-4af4-8420-f0c38afd7415

tune2fs -l /dev/sdf1
tune2fs 1.46.6 (1-Feb-2023)
Filesystem volume name:
Last mounted on: /app/config
Filesystem UUID: 859f8a89-643c-4af4-8420-f0c38afd7415
Filesystem magic number: 0xEF53
Filesystem revision #: 1 (dynamic)
Filesystem features: has_journal ext_attr resize_inode dir_index filetype needs_recovery extent 64bit flex_bg sparse_super large_file huge_file dir_nlink extra_isize metadata_csum
Filesystem flags: signed_directory_hash
Default mount options: user_xattr acl
Filesystem state: clean
Errors behavior: Continue
Filesystem OS type: Linux
Inode count: 1098645504
Block count: 4394580992
Reserved block count: 219729049
Overhead clusters: 69248137
Free blocks: 221166500
Free inodes: 1000111386
First block: 0
Block size: 4096
Fragment size: 4096
Group descriptor size: 64
Blocks per group: 32768
Fragments per group: 32768
Inodes per group: 8192
Inode blocks per group: 512
Flex block group size: 16
Filesystem created: Sat Apr 23 17:59:25 2022
Last mount time: Tue Jan 28 00:03:21 2025
Last write time: Tue Jan 28 00:03:21 2025
Mount count: 1
Maximum mount count: -1
Last checked: Mon Jan 27 21:41:38 2025
Check interval: 0 ()
Lifetime writes: 28 TB
Reserved blocks uid: 0 (user root)
Reserved blocks gid: 0 (group root)
First inode: 11
Inode size: 256
Required extra isize: 32
Desired extra isize: 32
Journal inode: 8
Default directory hash: half_md4
Directory Hash Seed: a51ab373-bff1-42e1-802f-7298324b69c8
Journal backup: inode blocks
Checksum type: crc32c
Checksum: 0xc7bdd27f

gingerbread233 · January 28, 2025, 8:47pm

There are no errors, fsck did some repairing, but the issue stays the same.

Toyoo · January 28, 2025, 10:45pm

Almost exactly 5 GB left, which matches the node’s safeguard code not to take all disk space. So this part works.

The node believes it stores 16.1 TB, which also is close enough to actual disk usage, so good.

So indeed it does look like some failure of the garbage collector process. Yep, the next step would be to try looking for any clues of a misbehaving collector. Do you have any files in the config/retain subdirectory of your node? Could you try collecting a week (that’s roughly how often bloom filters are sent) worth of storage node logs and look for any reported problems around collectors?

Minor thing: the file system has 5% of disk space reserved for root use. You likely don’t need this, and so you can tune it down to, let say, 1%. You can change this with tune2fs -m 1.

Alexey · January 29, 2025, 8:31am

If it did some repairs, you need to run it again and repeat, until it wouldn’t report that something is changed.

gingerbread233 · January 29, 2025, 6:13pm

The retain folder is showing the following:

Strangely it is just this node.

gingerbread233 · January 29, 2025, 6:17pm

second run gives me this:

Blockquotesudo fsck -y /dev/sdf1
fsck from util-linux 2.36.1
e2fsck 1.46.6 (1-Feb-2023)
/dev/sdf1: clean, 98445506/1098645504 files, 4173480084/4394580992 blocks

seems to be fixed by fsck

Toyoo · January 29, 2025, 8:55pm

So your node does receive bloom filters. Good. If you are not observing disk space freeing up right now, then for some reason they are not executed. You need to look into logs to find out the reason.

snorkel · January 30, 2025, 11:07am

Maybe:

stop and rm the storagenode,
delete all databases,
activate badger cache,
activate startup piece scan,
disable lazzy mode.

Start and let it finish all piecescans. Watch the retain logs and trash logs to show finished or success.
It will clean itself in a week or two… maybe.
Can’t hurt to try it.
I’ve done this on all my nodes, just to get rid of huge useless databases.
The walkers will create new db-es up-to-date, and bloom filters will clean the trash. Hopefully.

gingerbread233 · January 30, 2025, 1:46pm

Can you tell me how to achieve this running as a docker stack?

Roberto · January 30, 2025, 2:45pm

 --pieces.file-stat-cache=badger \
 --storage2.piece-scan-on-startup=true \
 --pieces.enable-lazy-filewalker=false

Toyoo · January 30, 2025, 2:54pm

Please don’t do this until snorkel provides an explanation why would this work.

EasyRhino · January 30, 2025, 6:22pm

snorkel, are you me?

I have a couple of nodes that are both slow and mounting even slower NFS drives for their data. my “retain” bloom filters would often fail, especially during the high load august (but I have one that’s failing right now). eventually it finished running the retain process and moving items to garbage.

If the retain/garbage collection is the problem my vote would be:

use badger cache (not a dramatic help, but maybe a little)
disable lazy (becuase they just plain fail when the storage backend is slow)
check your config.yaml for retain.concurrency: 1 (if it’s more than 1, make it 1 so the system doesn’t try multiple at once)
if your badger cache is already populated, maybe disable piece scan on startup. because there is probably a retain job waiting to run and the used space filewalker would just slow it down.
(if you are turning on badger cache for the first time, then you will need to run a full piece scan used space filewalker to populate it)

alpharabbit · January 30, 2025, 6:26pm

If a filewalker fails there sould be some error message. Is log level set to info or higher?

gingerbread233 · January 30, 2025, 7:18pm

It’s set to “nothing” so everything gets logged. I set the logfile size to limit the filesize. How do I change that, to just get the messages I need, and not every up- and download? This maybe would make it easier to debug and find the problem.

Mark · January 31, 2025, 1:43am

I’ve never changed my log level so I cannot confirm this, but try: How to switch log levels - #4 by Erikvv

I would probably try setting it to: “WARN”

snorkel · January 31, 2025, 5:34am

Check my post and maybe the entire thread:
https://forum.storj.io/t/log-custom-level/25839/18?u=snorkel

snorkel · January 31, 2025, 5:42am

I forgot to mention, and I’ll edit my post (again);
when deleting db-es to recreate them with a new scan, you should also delete those 2 extra dirrectories: the filestore cache (if the badger was already on) and the piecestore expiration.
A successul piecescan will recreate badger dir, expirations dir and all db-es.
If you can’t run a successful piece scan, or a successful retain after removing all that, your setup has a problem… to slow, usb connections, controllers, I don’t know… maybe disk dieing.
I could run a successful piecescan on 6TB node, on a machine with 1GB RAM and 2 nodes running. So not the compute power is to blame.