Node having issues after operating for long time without problems

gkn · November 15, 2022, 8:45am

I’ve been running a node for very long time. Yesterday I had a little “incident” as I had to resize docker-volume (“/var/lib/docker”). Just for notice I’ve a separate volume for node storage.

After resizing was done something must have gone wrong as I had to re-deploy every docker container. All docker volumes still were present. When it came to place to restore storj-node at first glance it looked fine, but then I had these weird errors. I don’t know why this happened as I stopped docker before doing volume resize.

The storj-storage volume has 550 GB (of course as it was already running almost full):
550G allocated 522G , avail 27G (used 96%)

2022-11-15T08:42:20.377Z	WARN	piecestore:monitor	Disk space is less than requested. Allocated space is	{"Process": "storagenode", "bytes": 28450369536}
2022-11-15T08:42:20.377Z	ERROR	piecestore:monitor	Total disk space is less than required minimum	{"Process": "storagenode", "bytes": 500000000000}
2022-11-15T08:42:20.377Z	ERROR	services	unexpected shutdown of a runner
Error: piecestore monitor: disk space requirement not met
2022-11-15 08:42:20,510 INFO exited: storagenode (exit status 1; not expected)

and further on

Error: Error starting master database on storagenode: database: info opening file "config/storage/info.db" failed: context canceled
storj.io/storj/storagenode/storagenodedb.(*DB).openDatabase:324
storj.io/storj/storagenode/storagenodedb.(*DB).openExistingDatabase:306
storj.io/storj/storagenode/storagenodedb.(*DB).openDatabases:281
storj.io/storj/storagenode/storagenodedb.OpenExisting:248
main.cmdRun:193
storj.io/private/process.cleanup.func1.4:377
storj.io/private/process.cleanup.func1:395
github.com/spf13/cobra.(*Command).execute:852
github.com/spf13/cobra.(*Command).ExecuteC:960
github.com/spf13/cobra.(*Command).Execute:897
storj.io/private/process.ExecWithCustomConfigAndLogger:92
main.main:478
runtime.main:255
2022-11-15 08:42:21,540 INFO stopped: storagenode (exit status 1)
2022-11-15 08:42:21,541 INFO stopped: processes-exit-eventlistener (terminated by SIGTERM)

I didn’t change anything on the storj-storage volume I’m binding to the storj-container! It still has the same size and I’m very confused why this happens? I already searched forum and storj-docs for that and tried backing up all db-files and trying node restart to recover them as new db’s. But the error happens again.

Alexey · November 15, 2022, 11:46am

Hello @gkn,
Welcome to the forum!

Please try to stop and remove the storagenode container and run it back using all your parameters.
After that make sure that databases are available for the user who run the storagenode container.
Maybe you run your node before with sudo and now added option --user $(id -u):$(id -g)?

gkn · November 15, 2022, 12:38pm

Hello @Alexey,
I already tried that so many times now without success :-/

It feels like the only chance is to totally prune storage, but of course this really feels bad to purge all stored files and risk to get punished for that by storj-network.

Alexey · November 15, 2022, 12:42pm

Did you have option --user $(id -u):$(id -g) before?

gkn · November 15, 2022, 12:50pm

I tried from scratch as documented here:

before i moved config.yaml to config.yaml~bak

Obviously I replaced the and with the correct paths.

docker run --rm -e SETUP="true" \
    --user $(id -u):$(id -g) \
    --mount type=bind,source="<identity-dir>",destination=/app/identity \
    --mount type=bind,source="<storage-dir>",destination=/app/config \
    --name storagenode storjlabs/storagenode:latest

Still the same faulty result:

2022-11-15T12:46:05.450Z	WARN	piecestore:monitor	Disk space is less than requested. Allocated space is	{"Process": "storagenode", "bytes": 28450234368}
2022-11-15T12:46:05.450Z	ERROR	piecestore:monitor	Total disk space is less than required minimum	{"Process": "storagenode", "bytes": 500000000000}
2022-11-15T12:46:05.450Z	ERROR	services	unexpected shutdown of a runner	{"Process": "storagenode", "name": "piecestore:monitor", "error": "piecestore monitor: disk space requirement not met", "errorVerbose": "piecestore monitor: disk space requirement not met\n\tstorj.io/storj/storagenode/monitor.(*Service).Run:125\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2.1:87\n\truntime/pprof.Do:40\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2:86\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:75"}
2022-11-15T12:46:05.450Z	ERROR	nodestats:cache	Get pricing-model/join date failed	{"Process": "storagenode", "error": "context canceled"}

I’m absolutly confused by this:
piecestore monitor: disk space requirement not met

I’m shure I somewhere noticed even for total node shutdown one should reduce storage to help the network to relocate the data. But this indicates it’s not possible obvisously.

But never the less… I don’t want to quit, but if the node doesn’t work anymore and storj may reduce my reputation I think about such a hard quit :-/

Alexey · November 15, 2022, 12:56pm

Please never run setup against a worked node - you may destroy it, if you would provide a wrong path for identity or data. This command should be run only once for entire node’s life.

What am I asking is not related to config.yaml. Did you have option --user $(id -u):$(id -g) in your normal full docker run command before?

If you do not remember, then please show the result of the command (replace the path /mnt/storj/storagenode/ to yours):

ls -l /mnt/storj/storagenode/storage

is related either to corrupted databases or a wrong permissions.

gkn · November 15, 2022, 1:03pm

Total disk space is less than required minimum

I found the culprit for this… for some reason in config.yaml (may be related trying to fix that by recreating this config due to moving the old original one as config.yaml~bak) there was configured 1 TB… this explains a lot.

But still the node has fails:

2022-11-15T13:01:23.462Z	ERROR	piecestore:cache	error getting current used space: 	{"Process": "storagenode", "error": "context canceled; context canceled; context canceled; context canceled; context canceled; context canceled; context canceled", "errorVerbose": "group:\n--- context canceled\n--- context canceled\n--- context canceled\n--- context canceled\n--- context canceled\n--- context canceled\n--- context canceled"}
Error: piecestore monitor: disk space requirement not met
2022-11-15 13:01:23,541 INFO exited: storagenode (exit status 1; not expected)

There’s something totally screwed …

And by the way… I know these sqlite-dbs are totally sensible for crashes and so on. Why the hell is there no node backup for such files inside storj-network f.e. by node-identifier key being able to do something like auto-recovery on such fails?

Funny … a storage/backup solution that doesn’t backup its own sensible databases …

gkn · November 15, 2022, 1:05pm

Did you have option --user $(id -u):$(id -g)
Regarding this… I’m using docker-compose and on the virtual machine docker is running as root… same for storage permissions actually.

So I don’t think this can be the issue … I checked file-permissions in storage as well just to be shure.

gkn · November 15, 2022, 1:19pm

I’ve backed up my original storage/*.db files as backup and did another copy into recovery/.

With this script I’ve tried to check for corrupt dbs, but all of them are fine:

#!/bin/bash
_pwd=$(pwd)

for sqlite in ${_pwd}/*.db;
do
  sqlite=$(basename ${sqlite});
  echo "Checking $sqlite"

  docker run --rm -it \
    -v ${_pwd}:/data/ \
    sstc/sqlite3 \
    sqlite3 ./${sqlite} "PRAGMA integrity_check"
done

gkn · November 15, 2022, 1:23pm

By the way I’m using local (bind) mount in docker-compose is this a problem?

  storj-node-storage:
    external: false
    name: storj-node-storage
    driver: local
    driver_opts:
      type: 'none'
      o: 'bind'
      device: /path/to/my/storage

May this cause any problems? My initial deployment was without docker-compose, but as I had some repeating issues with docker I decided to use more reliable deployments using docker-compose instead of manual deployments causing containers that are gone without having the actual configuration (which now is reflected in my docker-compose.yaml to do anytime deployment with the actual configuration).

Alexey · November 15, 2022, 1:37pm

doesn’t matter. The variables from the docker run command have a precedence above parameters in the config.yaml file.
Again, there is nothing related to your config.yaml file. This is either broken database or a wrong permissions, no other options.

You may register for free and do so

then please remove --user $(id -u):$(id -g)

The correct bind option for storagenode in the docker-compose.yaml file:

    volumes:
      - type: bind
        source: /mnt/w/storagenode5/identity/
        target: /app/identity
      - type: bind
        source: /mnt/w/storagenode5/
        target: /app/config

gkn · November 15, 2022, 1:38pm

ERROR piecestore:monitor Total disk space is less than required minimum {“Process”: “storagenode”, “bytes”: 500000000000}

Additional info for you:

df -h /mnt/storj-storage/
Filesystem      Size  Used Avail Use% Mounted on
-               570G  522G   47G  92% /mnt/storj-storage

I also tried to lv-resize the volume to additional 10GB to make shure it has some free space (which it had before, but not free 500 GB). It’s clear that an already running can not have free space as max-storage is configured (=500000000000 bytes = 500 GB).

So what option do I have? Prune the storage and start from scratch? Will storj-network do recovery on my node or will I get bad reputation for that?

Alexey · November 15, 2022, 1:42pm

You will recover if you would stop playing with volumes for a while

Please add an argument to your docker-compose.yaml for the storagenode:

    command:
      - "--storage2.monitor.minimum-disk-space=20GB"

When the filewalker re-calculate used space, it should update the database. After that you may remove this parameter.

However, if you provided a wrong path to the identity, when you run setup command, it could explain, why your node cannot recognize the used space - it belongs to another identity.

gkn · November 15, 2022, 1:47pm

The correct bind option for storagenode in the docker-compose.yaml file:

    volumes:
      - type: bind
        source: /mnt/w/storagenode5/identity/
        target: /app/identity
      - type: bind
        source: /mnt/w/storagenode5/
        target: /app/config

This is my full setup:

version: '3.5'

volumes:
  storj-node-identity:
    external: true
    name: storj-node-identity
  storj-node-storage:
    external: false
    name: storj-node-storage
    driver: local
    driver_opts:
      type: 'none'
      o: 'bind'
      device: /mnt/storj-storage

services:
    image: storjlabs/storagenode:latest
    restart: unless-stopped
    volumes:
      - storj-node-identity:/app/identity
      - storj-node-storage:/app/config
    ports:
      - 28967:28967/tcp
      - 28967:28967/udp

and the env is coming from docker .env file.

So I think that should be fine even though it looks a little bit different to your setup shown. To be shure I also tried your configuration directly setting up the volume in service.node.volume, but it results in the same issue like before:

Something is screwed with piecestore monitor …

2022-11-15T13:45:53.761Z	ERROR	collector	error during collecting pieces: 	{"Process": "storagenode", "error": "pieceexpirationdb: context canceled", "errorVerbose": "pieceexpirationdb: context canceled\n\tstorj.io/storj/storagenode/storagenodedb.(*pieceExpirationDB).GetExpired:39\n\tstorj.io/storj/storagenode/pieces.(*Store).GetExpired:532\n\tstorj.io/storj/storagenode/collector.(*Service).Collect:88\n\tstorj.io/storj/storagenode/collector.(*Service).Run.func1:57\n\tstorj.io/common/sync2.(*Cycle).Run:99\n\tstorj.io/storj/storagenode/collector.(*Service).Run:53\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2.1:87\n\truntime/pprof.Do:40\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2:86\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:75"}

Error: piecestore monitor: disk space requirement not met

Alexey · November 15, 2022, 1:49pm

it’s still the reason.

Please update your binding to a correct one, right now you uses dangerous -v equivalent (docker will create an empty volume, if disk would not be mounted for any reason).

gkn · November 15, 2022, 1:50pm

I’m trying … currently it looks promising

2022-11-15T13:48:35.430Z	WARN	piecestore:monitor	Disk space is less than requested. Allocated space is	{"Process": "storagenode", "bytes": 49924898816}
2022-11-15T13:48:35.430Z	INFO	trust	Scheduling next refresh	{"Process": "storagenode", "after": "5h57m21.523840512s"}
2022-11-15T13:48:35.431Z	INFO	bandwidth	Performing bandwidth usage rollups	{"Process": "storagenode"}
2022-11-15T13:48:44.716Z	INFO	orders.12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S	sending	{"Process": "storagenode", "count": 5}
....

Let’s see what happens. What exacty does this command-argument do? Will it somehow skip something in monitoring?

Alexey · November 15, 2022, 1:51pm

it lowers required minimum free space on the disk.
As result the node will start (even with foreign data) and filewalker starts to calculate used space and should update the database.

gkn · November 15, 2022, 1:54pm

Ok. Great… looks like it’s starting now… and even dashboard is showing up now. How long should I wait before removing the additional argument? Will the node do some garbage or cleanup or something in background?

Alexey · November 15, 2022, 1:55pm

It should. Wait for several hours (depends on used storage, used hardware and filesystem). You will see when the load will reduce.

gkn · November 16, 2022, 8:59am

Hmm… oh man… great. Yesterday it looked fine. Sorry was unable to report it due to limitations here for new user replies.

But today “Your node has been disqualified”
Great how STORJ is threatening operators giving their best getting disqualified for nothing, but problems that are caused by internal problems with STORJ node sqlite databases and something I would describe as “unclean garbage collection”.

As mentioned in thread named “[MAIN NODE] Your node has been disqualified”, this is permanent. If that is true I will totally quit STORJ now with my node. I’m really pissed of by that (not against you personally, but STORJ).

It’s totally okay to do such things if an operator is doing something wrong, but an operator running a node for month without interruption, this is like getting a kick in the balls for nothing.

I’m keeping the node online for now waiting for your reply if we can do something to get it un-disqualified. But if that’s the last state, I will shutdown immediately not wasting anymore time on that.