Debugging space usage discrepancies

I do not have this problem. It really sounds to me like a bunch of people who can’t debug their own problems complain loudly, whereas a silent majority just doesn’t have any problems. :person_shrugging: Maybe you shouldn’t be operating a node?

So, your node is failed to finish any filewalker, and the reason is

this is mean that your disk is too slow to respond. How this disk is connected to this PC? Is it SMR? What’s filesystem on this drive?

The recommendations how to reduce the response time:

  1. Stop the service
  2. Check the disk for errors and fix them
  3. Perform a defragmentation
  4. Enable automatic defragmentation if it were disabled (it’s enabled by default)
  5. Start the service
  6. Monitor for errors related to the filewalker.
  7. If you would still see errors with filewalkers, then disable the lazy mode and enable scan on the start if you disabled it, in your config.yaml:

save the config and restart the service. It will consume more IOPS than lazy, but should successful finish the scan. You need to keep it like this at least 2 weeks to allow your node to process two bloom filters for each satellite to move most of the garbage to the trash.
The trash will be cleaned automatically after a week.

1 Like

not many, only who did it wrong, I’m sorry. My nodes working normally.
At the moment these nodes are usually affected:

  • VM
  • FS: exFAT, zfs/BTRFS without caching device/enough RAM, network filesystems, NTFS under Linux
  • some RAID configurations with parity without proper tuning
  • running multiple nodes on the one disk/pool
  • Windows: disabled/never performed defragmentation for NTFS
  • FATAL errors in the log
  • failing disk (cable, power supply, bad blocks)

In all these cases the disabling a lazy mode usually helps (except FATAL errors or disk errors - they need to be fixed first).

3 Likes

Indeed, I think the problem is that gc-filewalker is being interrupted by other processes (like updating the node).

Add SMR drives to the list, and you’re probably quite complete.

So, indeed:

  • saving the bloom filter to the disk
  • iterating the subfolders of every satellite in an ordered fashion and save last filtered folder so you don’t have to start from scratch every time (so, the node is able to finish the bloom filter before the next one arrives).

Besides, I’m wondering whether files are being deleted right away during the gc-filewalk run.

SMR actually not an issue - they are slow on write, but not read, so they should be ok with a lazy filewalker (at least I cannot find an evidence of the issue with a SMR drive).

They can, but I cannot confirm that on my nodes, these filewalkers didn’t intersect (at least on my nodes), except maybe a used-space-filewalker.

1 Like

Well, I can tell you I only have this problem on nodes using a SMR drive.

Besides, deleting a file or moving out to trash also takes writing (and often also reordering) of meta data.

I see. Then I would add a SMR disks to the list of suspicious to a space usage discrepancy.
And the filesystem is zfs, single drive and no special device?
I would like to know, is SMR affected too if it’s ext4 formatted.

Yea bros, get going on this!
Story time!

  1. A node of mine, 8TB disk, set for 7TB in config.yaml, and stuffed to the brim, got like 1,2TB free space (guess some natural deletes occurred), but the ingress didn’t start, i guess it was waiting for filewalker to finish walking, he finished after ~96h and discovered that free space finally and started to hoover some more data, hurray!

  2. Some other node of mine, got 174h of uninterrupted online time, BUT it still was into the walking process, its 10TB disk, fully dedicated to STORJ, only ~320GB of free space left,
    174h ago, i set it to 4TB to stop ingress, because dashboard thinked he has some 1TB or more of free space! but he did not!
    So im waiting for him to finish the walk, it was just around the corner, and BAM, at 3:00 am an ISP restarted a router, and VPN needed to reconnect, and changed the open ports, so need to update the config.yaml, the storagenode process is still ongoing, still walking, but the port is new now, the one it was on, its closed now. Don’t wanna restart the storagenode.exe to get online, the IP is same, just wonder if it need the port opened to report data to satellites if he finish the walk or it will not report … lol. Im determined to be offline just to let it finish that walking, God bless it! Lol

It will update databases and report on the next check-in.

1 Like

Can’t you fix those ports to be remembered on each reconnect? Maybe you can talk to the VPN provider to reserve the ports for you… I don’t know how this works, I don’t use VPNs,… if I sound dumb. :sweat_smile:

For sure, and essentially I’m quite sure the walk just takes too long and is being interrupted by a update/restart of the system or another walk. Systems having less than 70% filled don’t have this particular problem in my case.

1 Like

Beautiful :heart_eyes:

2024-02-16T10:25:04Z    INFO    lazyfilewalker... {... "bloomFilterSize": 4100003}
2 Likes

It depends on the context, where are these fields. Sorry, I don’t remember exactly. But usually the software differentiate between the raw size of the original file, and the size of the encrypted data which is stored.

Usually there are only a few byte differences (The encryption includes a few additional bytes for message authentication. Think about it as a very lightweight checksum).

:thinking:

Ah, maybe it’s on the storagenode, where pieces are stored in separated files. The first few bytes contain a header which is included in total size, but not included in content size.

1 Like

Payment is based on content_size (includes encryption overhead, doesn’t include size of piece header / other technical entries)

1 Like