Something weird keeps happening

I have my STORJ stuff on a dedicated 20TB drive. The host machine runs Ubuntu 22 with SSD which holds the OS and apps including storj (docker). After awhile, my SSD gets a low on disk space warning from Ubuntu. Shortly afterward, the system won’t boot at all.

This same sequence of events happened first on ZimaBlade (32GB disk) then on Optiplex 5040 (250GB disk).

Before it died, Zima had 0 bytes remaining. I booted Ubu 22 with a USB drive and moved my Gridcoin and BOINC stuff from the ZimaBlade to another PC, freeing up 12GB. Within 3 days, the SSD was full again.

The Zima filled, froze and died on June 12. It had been running since May 27. On June 14, I moved the node over to the living room PC (Optiplex 5040). It filled up and died on the morning of June 18. Later that day, I bought a new 500GB SSD and installed it in an Optiplex 3050. This machine has been online since June 18 evening.

(Something else that’s weird is that the storj node was up to 11TB used on June 18 morning but when I moved it to the 3050, it now only has 7.38TB used. During its 4 day life on the 5040, the disk was going nuts and the space used rapidly climbed to slightly under 11TB.)

Both SSDs (Zima and Opti 5040) will no longer boot at all - they freeze on the clean command during startup (as best I can tell). I don’t know what’s taking up all the space on the disk so I can’t delete it. Maybe the drives won’t boot because there’s no room left?

Does storj use the system disk as temp storage? Is there some huge amount of data that gets written to the host PC somewhere? Or to logs?

I haven’t re-installed anything on the Optiplex 5040 or the ZimaBlade. The only way I can see the files on these systems is to boot via USB and look around on the disk. Is there a log file somewhere that might shed light on what’s going on?

The node ID is 1qvbF3RemiGpvKvz2YBaNL5NomCnfv74hCBPzxdG6QztsCgqjg

Any ideas are appreciated.

what is your Log level on node? As today it taking lot of space if log level is info

2 Likes

As @Vadim suggested, it’s almost certainly going to be logs caused by the huge data ingress we’ve been having.
You’ll need to either change your log levels or set up some form of log rotation :slight_smile:

2 Likes

As others said, your logs filled the container. Either lower the log level or redirect the log to the disk with the data.

To get the space back, remove the container, make the change in your config.yaml and start the container again.

2 Likes

How can you not find where the space is going? Run du on top-level directories… find the big one… cd into it… run du again etc… . But yeah if you haven’t set up log rotation it’s probably you Docker logs. Good luck!

4 Likes

Easier way is to run ncdu (if you’re on Linux)
It’ll traverse all the file system and show you what is taking up space :slight_smile:

4 Likes

I only hope that you backed up your identities to the disk with data. Otherwise if the identity is lost and you do not have a backup - your nodes lost too unfortunately.

… or dust, to do everything in one go, it’s available pretty much everywhere:

I have a backup of identity. I didn’t know about du. I found the log files in /var/lib/docker and blew away a 16GB file. Now the Zima boots up fine. Ditto for the Opti 5040 (28GB log file).
I still have to set the log file config and test that and then move the 20TB drive back downstairs to the ZimaBlade.

I have added

    --log-opt max-size=500m \
    --log-opt max-file=5 \

to my docker startup command. 500MB should be enough to keep enough relevant data, but I might increase it to 1G.
Probably best to tweak it to however much free space you have on your device.

1 Like

I tried the log-opt thing before. Just tried it again and I got this in the logs:
Error: unknown flag: --log-opt
Usage:
storagenode run [flags]

Flags:

I added the options to the command line right before the zksync stuff at the end.

Hmmm. I had 8.07TB stored according to the dashboard then I stopped the node and restarted to test the log file change. Now it’s back to 7.34TB. Why???

and what show disk itself? I had same problem, but my disk shown that HDD is full
But on other nodes it show that I have lot of trash, but in reality i see it is deleted already

Disks says 3.7TB free. Dashboard says:
Used 7.35TB
Free 11.54TB
Trash 28.97GB
Overused 0B

You need to add these

options before the image name, not after.

Because the databases were not updated.
Please search for errors related to the databases in your logs. And also it’s worth to check them:

You need to use --si option to get the same measure units as on the dashboard.
Also, the dashboard shows a free space in the allocation, not on the disk.

The database malformed thing found no problems. The log max-size and max-files seem to work.

1 Like

Good to know!
Do you have errors related to databases or filewalkers in your logs?

Since Saturday morning, i have 2640 piecestore upload failed / database is locked errors. I don’t see anything about filewalker. Is that something separate I have to run with docker or is that ‘built-in’ to storagenode?

2024-06-22T00:16:13Z ERROR piecestore upload failed {“Process”: “storagenode”, “Piece ID”: “2IHKWX3I6MHJH3VVHSFB6XRS6KFM4HJF4FUVF5WM6YX5FR63XJFQ”, “Satellite ID”: “1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE”, “Action”: “PUT”, “Remote Address”: “109.61.92.75:33284”, “Size”: 249856, “error”: “pieceexpirationdb: database is locked”, “errorVerbose”: “pieceexpirationdb: database is locked\n\tstorj.io/storj/storagenode/storagenodedb.(*pieceExpirationDB).SetExpiration:111\n\tstorj.io/storj/storagenode/pieces.(*Store).SetExpiration:584\n\tstorj.io/storj/storagenode/piecestore.(*Endpoint).Upload.func6:486\n\tstorj.io/storj/storagenode/piecestore.(*Endpoint).Upload:544\n\tstorj.io/common/pb.DRPCPiecestoreDescription.Method.func1:294\n\tstorj.io/drpc/drpcmux.(*Mux).HandleRPC:33\n\tstorj.io/common/rpc/rpctracing.(*Handler).HandleRPC:61\n\tstorj.io/common/experiment.(*Handler).HandleRPC:42\n\tstorj.io/drpc/drpcserver.(*Server).handleRPC:167\n\tstorj.io/drpc/drpcserver.(*Server).ServeOne:109\n\tstorj.io/drpc/drpcserver.(*Server).Serve.func2:157\n\tstorj.io/drpc/drpcctx.(*Tracker).track:35”}

It’s uploads of TTL data have not being registered as a pieces with TTL, this is would mean that these pieces will not be autodeleted when expired, but will be collected later with a garbage collector and moved to trash then deleted 7 days later, or up to two week in total instead of when they are expired.

So yes, this is still an issue. If you cannot add more RAM to that system, then probably the only solution would be to move databases to a less loaded disk/SSD if you have them. Perhaps even fast and durable USB stick would be good too, if you do not mind to lose statistic and history data if it dies (it will not affect payouts though).