Node restarts in a few hours

Balage76 · November 16, 2021, 4:07pm

Hello!
I have a node on an RPI 3B+ with a 2TB USB HDD attached. It was running for about 8 month without any problem.
A few days ago I started to notice that uptime is just a few hours. I looked into the log file and found this:

today at 13:30:09 2021-11-16T12:30:09.397Z ERROR piecestore:cache error getting current used space: {“error”: “lstat config/storage/blobs/ukfu6bhbboxilvt7jrwlqk7y2tapb5d2r2tsmj2sjxvw5qaaaaaa/6j/xqsn6rqsuiwyl7vacpzh6rked2g7uaibfk5pvljym3f6a6v4sa.sj1: input/output error”, “errorVerbose”: “lstat config/storage/blobs/ukfu6bhbboxilvt7jrwlqk7y2tapb5d2r2tsmj2sjxvw5qaaaaaa/6j/xqsn6rqsuiwyl7vacpzh6rked2g7uaibfk5pvljym3f6a6v4sa.sj1: input/output error\n\tstorj.io/storj/storage/filestore.walkNamespaceWithPrefix:788\n\tstorj.io/storj/storage/filestore.(*Dir).walkNamespaceInPath:725\n\tstorj.io/storj/storage/filestore.(*Dir).WalkNamespace:685\n\tstorj.io/storj/storage/filestore.(*blobStore).WalkNamespace:284\n\tstorj.io/storj/storagenode/pieces.(*Store).WalkSatellitePieces:497\n\tstorj.io/storj/storagenode/pieces.(*Store).SpaceUsedTotalAndBySatellite:662\n\tstorj.io/storj/storagenode/pieces.(*CacheService).Run:54\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2.1:87\n\truntime/pprof.Do:40\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2:86\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:57”}

today at 13:30:09 2021-11-16T12:30:09.438Z ERROR services unexpected shutdown of a runner {“name”: “piecestore:cache”, “error”: “lstat config/storage/blobs/ukfu6bhbboxilvt7jrwlqk7y2tapb5d2r2tsmj2sjxvw5qaaaaaa/6j/xqsn6rqsuiwyl7vacpzh6rked2g7uaibfk5pvljym3f6a6v4sa.sj1: input/output error”, “errorVerbose”: “lstat config/storage/blobs/ukfu6bhbboxilvt7jrwlqk7y2tapb5d2r2tsmj2sjxvw5qaaaaaa/6j/xqsn6rqsuiwyl7vacpzh6rked2g7uaibfk5pvljym3f6a6v4sa.sj1: input/output error\n\tstorj.io/storj/storage/filestore.walkNamespaceWithPrefix:788\n\tstorj.io/storj/storage/filestore.(*Dir).walkNamespaceInPath:725\n\tstorj.io/storj/storage/filestore.(*Dir).WalkNamespace:685\n\tstorj.io/storj/storage/filestore.(*blobStore).WalkNamespace:284\n\tstorj.io/storj/storagenode/pieces.(*Store).WalkSatellitePieces:497\n\tstorj.io/storj/storagenode/pieces.(*Store).SpaceUsedTotalAndBySatellite:662\n\tstorj.io/storj/storagenode/pieces.(*CacheService).Run:54\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2.1:87\n\truntime/pprof.Do:40\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2:86\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:57”}

Following this, the node restarted and since that working normally.
I checked with df -h command if there is enough free space and I have 1.1TB free out of 1.7TB.

Can you please help with any suggestion?
Thank you!
Balázs

Stob · November 16, 2021, 4:16pm

Hi @Balage76

There are two errors listed which are related. The first shows the piecestore:cache was unable to find the current amount of used space and then the same piecestore:cache process stopped.

The errors point to the node (CPU or RAM) being overwhelmed or the storage being too slow to respond. It’s hard to know exactly without more errors but it would be worth keeping a closer eye on the node.

anon27637763 · November 16, 2021, 4:36pm

Do you have a separate power supply for the HDD? Or is it relying on the RPi USB supplied power rails?

If the drive is running on the RPi power rails, it’s possible that the drive suddenly pulled too much power resulting in a voltage drop on the power supply and thus a reboot of the RPi.

Balage76 · November 16, 2021, 6:12pm

I have a genuine RPi power supply for the RPi and a dedicated psu for the HDD. It is a 3,5" HDD.
On the other hand, I found an other error message in the log, which might be connected to this issue:

today at 13:30:29 2021-11-16T12:30:29.611Z INFO failed to sufficiently increase receive buffer size (was: 176 kiB, wanted: 2048 kiB, got: 352 kiB). See UDP Receive Buffer Size · lucas-clemente/quic-go Wiki · GitHub for details.

This line was among the lines after the restart happened. I don’t know if it related, but I entered this command:

sudo sysctl -w net.core.rmem_max=2500000

Let’s see if it helps…
Thank you for you help and explanations!

deathlessdd · November 16, 2021, 6:46pm

I would say this is hardware issues its always bad to see this error in any aspects of the logs. First I would try a different usb port, and if you can change the usb wire. If that doesnt help then you may have a failing drive.

Alexey · November 17, 2021, 11:25am

The first step is to stop the storagenode and check this disk for errors with the sudo fsck -f /dev/sda1 (you need to unmount it first with sudo umount /mnt/storj or what is your system mount point).
If it’s a NTFS filesystem, then you need to check it with chkdsk /f on Windows rather on pi and think about migration from NTFS to ext4.

Balage76 · November 17, 2021, 1:13pm

Thank you Alexey, it turned out that I missed the ext4 partition creation at the very beginning, so it is on NTFS.
I will check the drive first with chkdsk on a Windows machine.
Is there any easy way to convert NTFS to ext4? Or just backup-format-restore?
If there is an easy way, I might give it a try if chkdsk does not found any bad sector.
If there are any bad sectors, I will try to do a migration/copy to an other disk to run the node further…

Alexey · November 17, 2021, 1:48pm

Yes, this is a proper way. You also could do that in a slow way (in-place) with LVM and GParted, but read my story:

Balage76 · November 17, 2021, 7:07pm

Ok, so the situation is the following:

I found a 4TB WD HDD, put it into an external case (has external psu) and attached to the RPi’s second USB port.
Now both HDDs are attached to the RPi.
Partitioned and formatted the 4TB drive, it has only one ext4 partition.
Old drive /dev/sda1 → /mnt/storage
New drive /dev/sdb1 → /mnt/storage2

The old drive has only Storj data, so can I simply copy everything from the old drive to the new one? For example with rsync command? I tried to understand rysnc parameters, but I’m not sure about the exact command to include everything…
Can you please help me with the exact command?

Alexey · November 17, 2021, 7:19pm

Yes, the rsync is enough.

run while the node is running:

rsync -avPh --inplace /mnt/storage /mnt/storage2

Run it several times until the difference would be negligible
Stop and remove the container
Run last time with --delete option (it will delete from the destination files deleted from the source during the copy)

rsync -avPh --inplace --delete /mnt/storage /mnt/storage2

Run the node back with all your parameters, include changed ones (path to data and maybe to the identity folder, if it’s on the disk with data (it’s recommended to move it to there from your SD card)).

Balage76 · November 17, 2021, 8:56pm

Thank you, rsync is running now.

Poor old HDD has some really hard time now becouse due to the restarts the filewalker is running almost all the time, plus normal node operation, plus now I started rsync… I hope it will survive the next few days…

Yep, identity folder is on the HDD.

Alexey · November 18, 2021, 7:29am

You can reduce the allocated space to the minimum (500GB) in the docker run command - it will stop any ingress, this should reduce the load.

Balage76 · November 20, 2021, 11:25am

Rsync is runnning for 2 days already (first time)… 615GB out of 831GB copied over… I reduced the allocated space to 700GB, after 6 hours, so no ingress at the moment. Yes, it is bit quicker without ingress traffic, but still…it takes a while to properly copy everything…

Balage76 · November 22, 2021, 5:53pm

Ok, so the situation is the following:
Rsync is done both with running and offline node.
I run the db integrity checks and I had to fix the bandwidth.db file. I did it, the check says it is OK.
On the other hand, I have database disk image malformed error for the piece_expiration.db.
I tried to fix it the same way as the bandwidth.db, but failed:

The node is not starting now.

today at 6:38:56 PM Container started
today at 6:38:53 PM runtime.main:255
today at 6:39:01 PM 2021-11-22T17:39:01.875Z INFO Configuration loaded {“Location”: “/app/config/config.yaml”}
today at 6:39:01 PM 2021-11-22T17:39:01.880Z INFO Operator email {“Address”: “xxxxxxxxxxxxxxx@hotmail.com”}
today at 6:39:01 PM 2021-11-22T17:39:01.880Z INFO Operator wallet {“Address”: “0x348Dxxxxxxxxxxxxxxxxxxxxxxxx”}
today at 6:39:01 PM Error: Error starting master database on storagenode: database: database disk image is malformed
today at 6:39:01 PM storj.io/storj/storagenode/storagenodedb.(*DB).openDatabase:323
today at 6:39:01 PM storj.io/storj/storagenode/storagenodedb.(*DB).openExistingDatabase:305
today at 6:39:01 PM storj.io/storj/storagenode/storagenodedb.(*DB).openDatabases:281
today at 6:39:01 PM storj.io/storj/storagenode/storagenodedb.OpenExisting:248
today at 6:39:01 PM main.cmdRun:160
today at 6:39:01 PM storj.io/private/process.cleanup.func1.4:363
today at 6:39:01 PM storj.io/private/process.cleanup.func1:381
today at 6:39:01 PM github.com/spf13/cobra.(*Command).execute:852
today at 6:39:01 PM github.com/spf13/cobra.(*Command).ExecuteC:960
today at 6:39:01 PM github.com/spf13/cobra.(*Command).Execute:897
today at 6:39:01 PM storj.io/private/process.ExecWithCustomConfig:88
today at 6:39:01 PM storj.io/private/process.ExecCustomDebug:70
today at 6:39:01 PM main.main:388
today at 6:39:01 PM runtime.main:255
today at 6:38:56 PM Container stopped

Alexey · November 22, 2021, 9:38pm

If you still has an original database - you can just write it over, the outdated database is better than empty.
Perhaps during the last rsync you forgot to add --delete option and your destination have had the journal files like *.db-wal and/or *.db-shm, they are exist if database was not closed properly or if you copied them on the fly (your case) and they were wrongly imported when you started a node on a new place.

To figure out which database is corrupted, you need to run a check script from the https://support.storj.io/hc/en-us/articles/360029309111-How-to-fix-a-database-disk-image-is-malformed- article.

On your screenshot I can see, that load database back has failed and rolled back. This is possible only if you skipped the fix

cat /storage/dump_all.sql | grep -v TRANSACTION | grep -v ROLLBACK | grep -v COMMIT >/storage/dump_all_notrans.sql

Balage76 · November 23, 2021, 8:51am

Yes, you are right, I forgot the last run with the --delete option.
Now the node is running again from the old HDD, so I’m back (almost) at the starting line.
I noticed that after I reduced the node size, the node was still restarting a few times. It seems that the restart happened when it wanted to empty the trash. I also noticed that the trash is keep staying over 106.1 GB for weeks now. Sometimes it goes up to 108-109GB, then down to 106.1GB.
Is this second database error related to this?

Alexey · November 23, 2021, 7:08pm

Perhaps you should change the owner for the trash. However, if you provide the full line with an error I can say more.

Balage76 · November 28, 2021, 4:00pm

Dear Alexey,

Thank you once again for your help! I managed to fix the node, however it was quite a rollercoaster ride…

On wednesday, the node started to crash more and more often, so I decided to stop it and do some more try with rsync. Unfortunately, I guess due to the old HDD’s errors even rsync (with delete) was not able to finish properly. I decided to take the old HDD and connect it to my windows computer to run chkdsk. Even chkdsk /f was not able to run completly. It simply froze while it tried to fix various .sj1 files.
By midnight, I simply gave up, becouse when I plugged back the old drive to the RPi, the system was not even starting. I seemed I lost my node…
For two days I wasn’t at home, but on saturday evening, I decided to do one more try. I took an other sd card, installed Ubuntu and connected the new HDD. I had to fix the mounting point as there wasn’t any on the new drive, but by the end of the day: the node was running!
The trash cleared properly, I have ingress and egress and no any warnings in the log.

The only issue now is the online score, but I guess it will improve over time:

Alexey · November 28, 2021, 7:14pm

Yes, it could be. You need to try over an over again, until it will fix all errors.
Unfortunately Linux is unable to fix NTFS errors, so this node can stop working in any time.

Balage76 · November 29, 2021, 8:05am

No-no. The node is now runing on the new HDD. The old one is on the shelf…