Something weird keeps happening

a75019b4749adb88f980 · June 19, 2024, 8:02pm

I have my STORJ stuff on a dedicated 20TB drive. The host machine runs Ubuntu 22 with SSD which holds the OS and apps including storj (docker). After awhile, my SSD gets a low on disk space warning from Ubuntu. Shortly afterward, the system won’t boot at all.

This same sequence of events happened first on ZimaBlade (32GB disk) then on Optiplex 5040 (250GB disk).

Before it died, Zima had 0 bytes remaining. I booted Ubu 22 with a USB drive and moved my Gridcoin and BOINC stuff from the ZimaBlade to another PC, freeing up 12GB. Within 3 days, the SSD was full again.

The Zima filled, froze and died on June 12. It had been running since May 27. On June 14, I moved the node over to the living room PC (Optiplex 5040). It filled up and died on the morning of June 18. Later that day, I bought a new 500GB SSD and installed it in an Optiplex 3050. This machine has been online since June 18 evening.

(Something else that’s weird is that the storj node was up to 11TB used on June 18 morning but when I moved it to the 3050, it now only has 7.38TB used. During its 4 day life on the 5040, the disk was going nuts and the space used rapidly climbed to slightly under 11TB.)

Both SSDs (Zima and Opti 5040) will no longer boot at all - they freeze on the clean command during startup (as best I can tell). I don’t know what’s taking up all the space on the disk so I can’t delete it. Maybe the drives won’t boot because there’s no room left?

Does storj use the system disk as temp storage? Is there some huge amount of data that gets written to the host PC somewhere? Or to logs?

I haven’t re-installed anything on the Optiplex 5040 or the ZimaBlade. The only way I can see the files on these systems is to boot via USB and look around on the disk. Is there a log file somewhere that might shed light on what’s going on?

The node ID is 1qvbF3RemiGpvKvz2YBaNL5NomCnfv74hCBPzxdG6QztsCgqjg

Any ideas are appreciated.

Vadim · June 19, 2024, 8:25pm

what is your Log level on node? As today it taking lot of space if log level is info

ACarneiro · June 19, 2024, 8:34pm

As @Vadim suggested, it’s almost certainly going to be logs caused by the huge data ingress we’ve been having.
You’ll need to either change your log levels or set up some form of log rotation

donald.m.motsinger · June 19, 2024, 9:33pm

As others said, your logs filled the container. Either lower the log level or redirect the log to the disk with the data.

To get the space back, remove the container, make the change in your config.yaml and start the container again.

Roxor · June 19, 2024, 10:17pm

How can you not find where the space is going? Run du on top-level directories… find the big one… cd into it… run du again etc… . But yeah if you haven’t set up log rotation it’s probably you Docker logs. Good luck!

ACarneiro · June 19, 2024, 10:37pm

Easier way is to run ncdu (if you’re on Linux)
It’ll traverse all the file system and show you what is taking up space

Alexey · June 20, 2024, 6:59am

I only hope that you backed up your identities to the disk with data. Otherwise if the identity is lost and you do not have a backup - your nodes lost too unfortunately.

arrogantrabbit · June 20, 2024, 4:35pm

… or dust, to do everything in one go, it’s available pretty much everywhere:

a75019b4749adb88f980 · June 20, 2024, 6:06pm

I have a backup of identity. I didn’t know about du. I found the log files in /var/lib/docker and blew away a 16GB file. Now the Zima boots up fine. Ditto for the Opti 5040 (28GB log file).
I still have to set the log file config and test that and then move the 20TB drive back downstairs to the ZimaBlade.

ACarneiro · June 20, 2024, 6:17pm

I have added

    --log-opt max-size=500m \
    --log-opt max-file=5 \

to my docker startup command. 500MB should be enough to keep enough relevant data, but I might increase it to 1G.
Probably best to tweak it to however much free space you have on your device.

a75019b4749adb88f980 · June 20, 2024, 6:44pm

I tried the log-opt thing before. Just tried it again and I got this in the logs:
Error: unknown flag: --log-opt
Usage:
storagenode run [flags]

Flags:
…

I added the options to the command line right before the zksync stuff at the end.

a75019b4749adb88f980 · June 20, 2024, 6:47pm

Hmmm. I had 8.07TB stored according to the dashboard then I stopped the node and restarted to test the log file change. Now it’s back to 7.34TB. Why???

Vadim · June 20, 2024, 6:50pm

and what show disk itself? I had same problem, but my disk shown that HDD is full
But on other nodes it show that I have lot of trash, but in reality i see it is deleted already

a75019b4749adb88f980 · June 20, 2024, 7:14pm

Disks says 3.7TB free. Dashboard says:
Used 7.35TB
Free 11.54TB
Trash 28.97GB
Overused 0B

Alexey · June 21, 2024, 3:42am

You need to add these

options before the image name, not after.

Alexey · June 21, 2024, 3:43am

Because the databases were not updated.
Please search for errors related to the databases in your logs. And also it’s worth to check them:

You need to use --si option to get the same measure units as on the dashboard.
Also, the dashboard shows a free space in the allocation, not on the disk.

a75019b4749adb88f980 · June 21, 2024, 10:03pm

The database malformed thing found no problems. The log max-size and max-files seem to work.

Alexey · June 22, 2024, 5:09am

Good to know!
Do you have errors related to databases or filewalkers in your logs?

a75019b4749adb88f980 · June 24, 2024, 4:01pm

Since Saturday morning, i have 2640 piecestore upload failed / database is locked errors. I don’t see anything about filewalker. Is that something separate I have to run with docker or is that ‘built-in’ to storagenode?

2024-06-22T00:16:13Z ERROR piecestore upload failed {“Process”: “storagenode”, “Piece ID”: “2IHKWX3I6MHJH3VVHSFB6XRS6KFM4HJF4FUVF5WM6YX5FR63XJFQ”, “Satellite ID”: “1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE”, “Action”: “PUT”, “Remote Address”: “109.61.92.75:33284”, “Size”: 249856, “error”: “pieceexpirationdb: database is locked”, “errorVerbose”: “pieceexpirationdb: database is locked\n\tstorj.io/storj/storagenode/storagenodedb.(*pieceExpirationDB).SetExpiration:111\n\tstorj.io/storj/storagenode/pieces.(*Store).SetExpiration:584\n\tstorj.io/storj/storagenode/piecestore.(*Endpoint).Upload.func6:486\n\tstorj.io/storj/storagenode/piecestore.(*Endpoint).Upload:544\n\tstorj.io/common/pb.DRPCPiecestoreDescription.Method.func1:294\n\tstorj.io/drpc/drpcmux.(*Mux).HandleRPC:33\n\tstorj.io/common/rpc/rpctracing.(*Handler).HandleRPC:61\n\tstorj.io/common/experiment.(*Handler).HandleRPC:42\n\tstorj.io/drpc/drpcserver.(*Server).handleRPC:167\n\tstorj.io/drpc/drpcserver.(*Server).ServeOne:109\n\tstorj.io/drpc/drpcserver.(*Server).Serve.func2:157\n\tstorj.io/drpc/drpcctx.(*Tracker).track:35”}

Alexey · June 25, 2024, 5:34am

It’s uploads of TTL data have not being registered as a pieces with TTL, this is would mean that these pieces will not be autodeleted when expired, but will be collected later with a garbage collector and moved to trash then deleted 7 days later, or up to two week in total instead of when they are expired.

So yes, this is still an issue. If you cannot add more RAM to that system, then probably the only solution would be to move databases to a less loaded disk/SSD if you have them. Perhaps even fast and durable USB stick would be good too, if you do not mind to lose statistic and history data if it dies (it will not affect payouts though).