Disk has now run out of space, even though allocated storage was set?

JWvdV · November 10, 2023, 4:47pm

I have a log rotation, as stated above the config is:

docker run -d --restart unless-stopped --stop-timeout 300 \
       (...) \
        --log-opt max-size=20m --log-opt max-file=3 \
        --name "storagenode" storjlabs/storagenode:latest \
        --storage2.monitor.minimum-disk-space=100MB \
        --storage2.piece-scan-on-startup=true \
        --pieces.enable-lazy-filewalker=false

So I enable scan on startup, but disable the lazy filewalker (in order to fall back to the non-lazy file walker).

But, this is the output of docker logs:

root@STORJ3:~# docker logs storagenode 2>/dev/null | head -n 55
downloading storagenode-updater
downloading storagenode
2023-11-10 09:51:11,647 INFO Set uid to user 0 succeeded
2023-11-10 09:51:11,661 INFO RPC interface 'supervisor' initialized
2023-11-10 09:51:11,662 INFO supervisord started with pid 1
2023-11-10 09:51:12,664 INFO spawned: 'processes-exit-eventlistener' with pid 48
2023-11-10 09:51:12,669 INFO spawned: 'storagenode' with pid 49
2023-11-10 09:51:12,672 INFO spawned: 'storagenode-updater' with pid 50
2023-11-10T09:51:12Z    INFO    Configuration loaded    {"Process": "storagenode-updater", "Location": "/app/config/config.yaml"}
2023-11-10T09:51:12Z    INFO    Invalid configuration file key  {"Process": "storagenode-updater", "Key": "operator.wallet-features"}
2023-11-10T09:51:12Z    INFO    Invalid configuration file key  {"Process": "storagenode-updater", "Key": "console.address"}
2023-11-10T09:51:12Z    INFO    Invalid configuration file key  {"Process": "storagenode-updater", "Key": "storage.allocated-bandwidth"}
2023-11-10T09:51:12Z    INFO    Invalid configuration file key  {"Process": "storagenode-updater", "Key": "server.address"}
2023-11-10T09:51:12Z    INFO    Invalid configuration file key  {"Process": "storagenode-updater", "Key": "operator.wallet"}
2023-11-10T09:51:12Z    INFO    Invalid configuration file key  {"Process": "storagenode-updater", "Key": "contact.external-address"}
2023-11-10T09:51:12Z    INFO    Invalid configuration file key  {"Process": "storagenode-updater", "Key": "storage2.database-dir"}
2023-11-10T09:51:12Z    INFO    Invalid configuration file key  {"Process": "storagenode-updater", "Key": "storage.allocated-disk-space"}
2023-11-10T09:51:12Z    INFO    Invalid configuration file key  {"Process": "storagenode-updater", "Key": "server.private-address"}
2023-11-10T09:51:12Z    INFO    Invalid configuration file key  {"Process": "storagenode-updater", "Key": "healthcheck.enabled"}
2023-11-10T09:51:12Z    INFO    Invalid configuration file key  {"Process": "storagenode-updater", "Key": "operator.email"}
2023-11-10T09:51:12Z    INFO    Invalid configuration file key  {"Process": "storagenode-updater", "Key": "healthcheck.details"}
2023-11-10T09:51:12Z    INFO    Invalid configuration file value for key                                                                                 {"Process": "storagenode-updater", "Key": "log.development"}
2023-11-10T09:51:12Z    INFO    Invalid configuration file value for key                                                                                 {"Process": "storagenode-updater", "Key": "log.level"}
2023-11-10T09:51:12Z    INFO    Invalid configuration file value for key                                                                                 {"Process": "storagenode-updater", "Key": "log.output"}
2023-11-10T09:51:12Z    INFO    Invalid configuration file value for key                                                                                 {"Process": "storagenode-updater", "Key": "log.caller"}
2023-11-10T09:51:12Z    INFO    Anonymized tracing enabled      {"Process": "storagenode-updater"}
2023-11-10T09:51:12Z    INFO    Running on version      {"Process": "storagenode-updater", "Service": "storagenode-updater", "Version": "v1.90.2"}
2023-11-10T09:51:12Z    INFO    Downloading versions.   {"Process": "storagenode-updater", "Server Address": "https://version.storj.io"}
2023-11-10T09:51:13Z    INFO    Configuration loaded    {"process": "storagenode", "Location": "/app/config/config.yaml"}
2023-11-10T09:51:13Z    INFO    Anonymized tracing enabled      {"process": "storagenode"}
2023-11-10T09:51:13Z    INFO    Operator email  {"process": "storagenode", "Address": "jwvdveer@gmail.com"}
2023-11-10T09:51:13Z    INFO    Operator wallet {"process": "storagenode", "Address": "0x4A93997a2a8aCd9F95A8362c130D4Dbee8D2Dfe3"}
2023-11-10T09:51:13Z    INFO    server  existing kernel support for server-side tcp fast open detected                                                   {"process": "storagenode"}
2023-11-10T09:51:13Z    INFO    Current binary version  {"Process": "storagenode-updater", "Service": "storagenode", "Version": "v1.90.2"}
2023-11-10T09:51:13Z    INFO    New version is being rolled out but hasn't made it to this node yet                                                      {"Process": "storagenode-updater", "Service": "storagenode"}
2023-11-10T09:51:13Z    INFO    Current binary version  {"Process": "storagenode-updater", "Service": "storagenode-updater", "Version": "v1.90.2"}
2023-11-10T09:51:13Z    INFO    New version is being rolled out but hasn't made it to this node yet                                                      {"Process": "storagenode-updater", "Service": "storagenode-updater"}
2023-11-10T09:51:13Z    INFO    Telemetry enabled       {"process": "storagenode", "instance ID": "127HsBPUCHDawobg1LMDaLNDWDtcHYjov4MYBYBrm88ePm5Tegn"}
2023-11-10T09:51:13Z    INFO    Event collection enabled        {"process": "storagenode", "instance ID": "127HsBPUCHDawobg1LMDaLNDWDtcHYjov4MYBYBrm88ePm5Tegn"}
2023-11-10 09:51:13,957 INFO success: processes-exit-eventlistener entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
2023-11-10 09:51:13,957 INFO success: storagenode entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
2023-11-10 09:51:13,958 INFO success: storagenode-updater entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
2023-11-10T09:51:14Z    INFO    db.migration    Database Version        {"process": "storagenode", "version": 54}
2023-11-10T09:51:14Z    INFO    preflight:localtime     start checking local system clock with trusted satellites' system clock.                         {"process": "storagenode"}
2023-11-10T09:51:15Z    INFO    preflight:localtime     local system clock is in sync with trusted satellites' system clock.                             {"process": "storagenode"}
2023-11-10T09:51:15Z    INFO    Node 127HsBPUCHDawobg1LMDaLNDWDtcHYjov4MYBYBrm88ePm5Tegn started                                                         {"process": "storagenode"}
2023-11-10T09:51:15Z    INFO    Public server started on [::]:28967     {"process": "storagenode"}
2023-11-10T09:51:15Z    INFO    bandwidth       Performing bandwidth usage rollups                                                                       {"process": "storagenode"}
2023-11-10T09:51:15Z    INFO    Private server started on 127.0.0.1:7778                                                                                 {"process": "storagenode"}
2023-11-10T09:51:15Z    INFO    trust   Scheduling next refresh {"process": "storagenode", "after": "2h31m41.322289225s"}
2023-11-10T09:51:15Z    INFO    pieces:trash    emptying trash started  {"process": "storagenode", "Satellite ID": "12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs"}
2023-11-10T09:51:15Z    WARN    piecestore:monitor      Disk space is less than requested. Allocated space is                                            {"process": "storagenode", "bytes": 2173001064320}

root@STORJ3:~# docker logs storagenode 2>&1 | grep -i walk

root@STORJ3:~#

root@STORJ3:~# ps -aux
USER         PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root           1  0.0  0.6 165616 10752 ?        Ss   09:45   0:00 /sbin/init
root         116  0.0  0.8  31328 13952 ?        S<s  09:45   0:00 /lib/systemd/systemd-journald
systemd+     148  0.0  0.5  16120  8320 ?        Ss   09:45   0:00 /lib/systemd/systemd-networkd
systemd+     152  0.0  0.8  25532 13556 ?        Ss   09:45   0:00 /lib/systemd/systemd-resolved
root         154  0.0  0.1   9492  2816 ?        Ss   09:45   0:00 /usr/sbin/cron -f -P
message+     155  0.0  0.3   8556  4736 ?        Ss   09:45   0:00 @dbus-daemon --system --address=systemd: --nofork --nopidfile --systemd-activation --syroot         158  0.0  1.0  34300 16128 ?        Ss   09:45   0:00 /usr/bin/python3 /usr/bin/networkd-dispatcher --run-startup-triggers
syslog       161  0.0  0.3 222400  5248 ?        Ssl  09:45   0:00 /usr/sbin/rsyslogd -n -iNONE
root         171  0.0  0.4  14904  7040 ?        Ss   09:45   0:00 /lib/systemd/systemd-logind
root         242  0.0  2.1 1356508 33528 ?       Ssl  09:45   0:12 /usr/bin/containerd
root         245  0.0  0.1   8396  2432 pts/0    Ss+  09:45   0:00 /sbin/agetty -o -p -- \u --noclear --keep-baud console 115200,38400,9600 vt220
root         247  0.0  0.1   8396  2304 pts/1    Ss+  09:45   0:00 /sbin/agetty -o -p -- \u --noclear --keep-baud pts/1 115200,38400,9600 vt220
root         249  0.0  0.1   8396  2176 pts/2    Ss+  09:45   0:00 /sbin/agetty -o -p -- \u --noclear --keep-baud pts/2 115200,38400,9600 vt220
root         252  0.0  0.1   8396  2176 pts/3    Ss+  09:45   0:00 /sbin/agetty -o -p -- \u --noclear --keep-baud pts/3 115200,38400,9600 vt220
root         258  0.0  0.1   8396  2432 pts/4    Ss+  09:45   0:00 /sbin/agetty -o -p -- \u --noclear --keep-baud pts/4 115200,38400,9600 vt220
root         260  0.0  0.5  15424  9216 ?        Ss   09:45   0:00 sshd: /usr/sbin/sshd -D [listener] 0 of 10-100 startups
root         291  0.0  4.8 1536856 76540 ?       Ssl  09:45   0:08 /usr/bin/dockerd -H fd:// --containerd=/run/containerd/containerd.sock
root         380  0.0  0.2  10204  3328 pts/5    Ss   09:45   0:00 /bin/bash
root         618  0.0  0.6 719596 10816 ?        Sl   09:51   0:04 /usr/bin/containerd-shim-runc-v2 -namespace moby -id 23b70e4a2c4bac8ed278e9ba4212fef848root         639  0.0  0.1  24724  2604 ?        Ss   09:51   0:13 /usr/bin/python2 /usr/bin/supervisord -c /etc/supervisor/supervisord.conf
root         693  0.0  0.0   3736   384 ?        S    09:51   0:00 /bin/bash /bin/stop-supervisor
root         694  6.2  1.6 1256496 26360 ?       Sl   09:51  21:58 /app/storagenode run --config-dir config --identity-dir identity --metrics.app-suffix=-root         695  0.0  0.2 1242760 3956 ?        Sl   09:51   0:17 /app/storagenode-updater run --binary-location /app/storagenode --config-dir config --iroot        1541  0.0  0.2  12896  3328 pts/5    R+   16:45   0:00 ps -aux

root@STORJ3:~# free -m
               total        used        free      shared  buff/cache   available
Mem:            1536         193        1136           0         206        1342
Swap:              0           0           0

root@STORJ3:~# uptime
 17:11:50 up  7:26,  0 users,  load average: 11.53, 13.34, 13.76

So there’s a full log spanning 7+ hours, not containing anything about a file walker. Nor a process listed looking like it or significant memory use.

So it seems that disabling the lazy filewalker, also disables the back-up (non-lazy) filewalker. But using the lazy filewalker filles up the disk to the rim, killing the node.

slasko · November 21, 2023, 4:14pm

I have set the storage.allocated-disk-space: 500GB as someone proposed, and slowly I am regaining disk storage.
Question: when is it safe to set back the allocated storage to 3 TB, on my 4TB disk?
Is it when there is 1TB of free space?

daki82 · November 21, 2023, 7:56pm

I think you need to figure it out yourself, since nobody knows.
I would do an full defrag with ultradefrag when 1TB is free, then set it back to the desired value.

miicar204 · July 28, 2024, 1:18am

I am also having this issue on windows GUI installation! (dedicated to Storj) data disk filled up (way over allocated), then the storj service started crashing. Finally, after much google-fu and messing around, I got it running again, but then it had “Disk space is less than requested.” error…even though the disk is dedicated to storj. Had to change the “storage2.monitor.minimum-disk-space:” to 1MiB to get it to even run.

(removed non-issue question)

ANYWAY…ATM I have it set to 500gb (as suggested), in hopes the filewalker will do its job and delete the trash and regain the space (1.4TB; originally with 1TB allocated). And hopefully that will allow me to go back to 1.4TB.
Still doesn’t explain why this brand new node would over inflate this much?!?!?! (and never showed full within the dashboard).

I also have that log getting huge (even after set to “warn” only). I wish that would auto clear itself…but that’s minor compared to the rest of the stuff happening!

JWvdV · July 28, 2024, 1:57am

Did you search the forum? This problem has comen along so many times… Actually, there are two. Because the error hasn’t to do anything with the problem of the disk usage. Wait, actually it is one problem, because that error is just normal behaviour.

miicar204 · July 28, 2024, 2:04am

yes. i remembered the “unexpected EOF” error is normal behavior…the overfilled bloating of the drive is NOT. (i will edit that error question out. Running on little sleep and much frustration).

Technically 2 problems…the inflating disk, way above allocation, and the logfile not self cleaning…i can live with the logfile issue…not the drive one tho.

Alexey · July 28, 2024, 4:20am

It will not clean itself, you may setup a logrotate

or use a custom script:

This is mean that the usage in the databases is incorrect. To fix it you need to have an enabled scan on startup (it’s enabled by default) and restart the node, to track the progress of the used-space-filewalker:

Make sure that you do not have errors related to the filewalker (search for error and filewalker) and/or the databases (search for error and database).

miicar204 · July 28, 2024, 6:06am

I was getting,

"ERROR piecestore:cache error getting current used space: {“error”: “filewalker: context canceled; filewalker: context canceled”

But no DB errors.

I had most of those settings, you linked to, already implemented in the .yaml (as well as setting the allocation to 500GB temporarily). I went through and verified everything was correct and restarted the node i haven’t seen the above error yet. i noticed i’ve lost 40GB more in the last 5 hours…the disk only has 60GB left (reported by windows). I’m hoping it starts deleting files not needed! I’ll report back tomorrow with the results.
thank you for your assistance.

Alexey · July 28, 2024, 8:06am

Usually means two things:

either the node is in the termination process (so, it would kill all subprocesses and routines, thus you would see a massive context canceled and similar errors); you need to search for the termination messages in the logs before;
or the disk is too slow to respond and the process/subroutine has been terminated due to a timeout. In this case you need to improve the disk subsystem. Not always possible, unfortunately.

If you checked the usage on the dashboard - it uses the databases, and if it’s not updated, well, it’s incorrect.

yes, it will. However, only after processed of the collector (it should delete all TTL data, but, if you deleted the databases, it will be collected only by the garbage collector instead, that’s mean - two more weeks), and after processing of the trash-filewalker (which deletes the expired data from the trash older than 7 days).

miicar204 · July 28, 2024, 6:00pm

so, no. After 12 hours, the used space, in the dashboard, does not match what was “used” before the whole crash and burn. yes, i did remove the DB’s whilst trying to get everything going again (they were on the SSD install folder, not the HDD data drive, and got deleted while i was trying to get the program to install again…). Oddly enough, my audit score is still 99%+…so im not sure if it even had the right amount showing before, or the filewalker hasn’t actually been running properly. I’m now down to 4GB free (windows reporting).

i guess there isn’t any way to trigger that to happen sooner (since i deleted the DB’s)?

what happens when the drive fills to 100%? does the node just crash again and therefore, also the filewalker stops working on stuff?

The log shows its downloading new pieces, the disk activity for that HDD is pinned at 100% in task manager, and there are read/writes happening constantly, so i am assuming things are happing??? Just not ideal that the whole drive is full when i only gave 3/4’s of it to storj and use the rest for local file storage. Since this is only a few months old (and my first windows based node), i am tempted to delete it and start over in hopes it works better…would only be a few dollars lost.

Defiantly had much better stability with the docker containers on my servers, then this windows gui (when i’m not breaking the file system at least)!

Alexey · July 29, 2024, 4:53am

The lazy filewalkers may run days. To reduce the time to hours you may disable a lazy mode. However, it would require to restart the node and the filewalker will start over.

The reinstallation was not needed. But since databases are deleted, the stat and history were lost. The history would be received from the trusted satellites after a while, but the current stat would be wrong, including payout estimation. The storage usage should be recovered by the used-space-filewalker, it just should finish without errors and stat would be updated if you do not have errors related to the databases. The bandwidth stat wouldn’t be recovered for the current period.

If your audit score is less than 100% this is mean that your node lost some data (or access to it), it’s not related to the filewalkers, audits are conducted by satellites.

No. This information was in the database, it also exist on the satellites’ side, but it will not be resent. Just the satellite will send a Bloom Filter (roughly once a week, maybe more often recently) and the Garbage Collector will use it to move the deleted data to the trash, and 7 days later the trash-filewalker will delete it permanently.

We have an emergency protection for this case: if there is less than 5GB in the allocation/disk, the node will send a message to all satellites, that it’s full.

More like yes. But you can track its progress: