Node keeps restarting raspberry

4ich · July 23, 2021, 10:33pm

Hi there
I have a quad sata hat on my pi4b 8gb with 4x2tb Drives in e5.
Since a few days i get notifications that my node is offline for a few Minutes.

I got this error in the log.


'''2021-07-23T22:13:25.576Z        ERROR   piecestore      download failed {"Piece ID": "K7DCW53X3AA4PSB3BA2NAOLZEU3L4GL7YLFEL2QJIRS7CXJ3NCPA", "Satellite ID": "121RTSDpyNZVcEU84Ticf2L1ntiuUimbWgfATz21tuvgk3vzoA6", "Action": "GET_REPAIR", "error": "file does not exist", "errorVerbose": "file does not exist\n\tstorj.io/common/rpc/rpcstatus.Wrap:73\n\tstorj.io/storj/storagenode/piecestore.(*Endpoint).Download:534\n\tstorj.io/common/pb.DRPCPiecestoreDescription.Method.func2:217\n\tstorj.io/drpc/drpcmux.(*Mux).HandleRPC:33\n\tstorj.io/common/rpc/rpctracing.(*Handler).HandleRPC:58\n\tstorj.io/drpc/drpcserver.(*Server).handleRPC:102\n\tstorj.io/drpc/drpcserver.(*Server).ServeOne:60\n\tstorj.io/drpc/drpcserver.(*Server).Serve.func2:95\n\tstorj.io/drpc/drpcctx.(*Tracker).track:51"}
2021-07-23T22:13:54.548Z        ERROR   pieces:trash    emptying trash failed                                                         {"error": "pieces error: filestore error: context canceled", "errorVerbose": "pieces error: filestore error: context canceled\n\tstorj.io/storj/storage/filestore.(*blobStore).EmptyTrash:154\n\tstorj.io/storj/storagenode/pieces.(*BlobsUsageCache).EmptyTrash:310\n\tstorj.io/storj/storagenode/pieces.(*Store).EmptyTrash:367\n\tstorj.io/storj/storagenode/pieces.(*TrashChore).Run.func1:51\n\tstorj.io/common/sync2.(*Cycle).Run:92\n\tstorj.io/common/sync2.(*Cycle).Start.func1:71\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:57"}
2021-07-23T22:13:54.856Z        ERROR   piecestore:cache        error getting current used space:                                     {"error": "lstat config/storage/blobs/qstuylguhrn2ozjv4h2c6xpxykd622gtgurhql2k7k75wqaaaaaa/y2/fofv75k5nij2mhx7aq2kuyu66nwkv6l5kvncudl7n32pastujq.sj1: structure needs cleaning; lstat config/storage/blobs/v4weeab67sbgvnbwd5z7tweqsqqun7qox2agpbxy44mqqaaaaaaa/fr/5rwz7spaqjqdhmj5wobmyi436eqzcnjtawgwfjifvq5pmpzjxq.sj1: structure needs cleaning; lstat config/storage/blobs/ukfu6bhbboxilvt7jrwlqk7y2tapb5d2r2tsmj2sjxvw5qaaaaaa/wn/qg55nknqud4j6xf2npqmmtkgud5pk5jypmv3zhzei2kbom2tnq.sj1: structure needs cleaning; lstat config/storage/blobs/pmw6tvzmf2jv6giyybmmvl4o2ahqlaldsaeha4yx74n5aaaaaaaa/dk/fzys34za3bz2sxdgaut2elcyppz3yuxisro3br4fugf4gxd3iq.sj1: structure needs cleaning; lstat config/storage/blobs/6r2fgwqz3manwt4aogq343bfkh2n5vvg4ohqqgggrrunaaaaaaaa/hq/a6ohctx4ymlesavuvud3s3s5n5ljd4d3yb6vlcgetzgsknrdzq.sj1: structure needs cleaning; context canceled; context canceled", "errorVerbose": "group:\n--- lstat config/storage/blobs/qstuylguhrn2ozjv4h2c6xpxykd622gtgurhql2k7k75wqaaaaaa/y2/fofv75k5nij2mhx7aq2kuyu66nwkv6l5kvncudl7n32pastujq.sj1: structure needs cleaning\n\tstorj.io/storj/storage/filestore.walkNamespaceWithPrefix:788\n\tstorj.io/storj/storage/filestore.(*Dir).walkNamespaceInPath:725\n\tstorj.io/storj/storage/filestore.(*Dir).WalkNamespace:685\n\tstorj.io/storj/storage/filestore.(*blobStore).WalkNamespace:284\n\tstorj.io/storj/storagenode/pieces.(*Store).WalkSatellitePieces:497\n\tstorj.io/storj/storagenode/pieces.(*Store).SpaceUsedTotalAndBySatellite:662\n\tstorj.io/storj/storagenode/pieces.(*CacheService).Run:54\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2.1:87\n\truntime/pprof.Do:40\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2:86\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:57\n--- lstat config/storage/blobs/v4weeab67sbgvnbwd5z7tweqsqqun7qox2agpbxy44mqqaaaaaaa/fr/5rwz7spaqjqdhmj5wobmyi436eqzcnjtawgwfjifvq5pmpzjxq.sj1: structure needs cleaning\n\tstorj.io/storj/storage/filestore.walkNamespaceWithPrefix:788\n\tstorj.io/storj/storage/filestore.(*Dir).walkNamespaceInPath:725\n\tstorj.io/storj/storage/filestore.(*Dir).WalkNamespace:685\n\tstorj.io/storj/storage/filestore.(*blobStore).WalkNamespace:284\n\tstorj.io/storj/storagenode/pieces.(*Store).WalkSatellitePieces:497\n\tstorj.io/storj/storagenode/pieces.(*Store).SpaceUsedTotalAndBySatellite:662\n\tstorj.io/storj/storagenode/pieces.(*CacheService).Run:54\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2.1:87\n\truntime/pprof.Do:40\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2:86\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:57\n--- lstat config/storage/blobs/ukfu6bhbboxilvt7jrwlqk7y2tapb5d2r2tsmj2sjxvw5qaaaaaa/wn/qg55nknqud4j6xf2npqmmtkgud5pk5jypmv3zhzei2kbom2tnq.sj1: structure needs cleaning\n\tstorj.io/storj/storage/filestore.walkNamespaceWithPrefix:788\n\tstorj.io/storj/storage/filestore.(*Dir).walkNamespaceInPath:725\n\tstorj.io/storj/storage/filestore.(*Dir).WalkNamespace:685\n\tstorj.io/storj/storage/filestore.(*blobStore).WalkNamespace:284\n\tstorj.io/storj/storagenode/pieces.(*Store).WalkSatellitePieces:497\n\tstorj.io/storj/storagenode/pieces.(*Store).SpaceUsedTotalAndBySatellite:662\n\tstorj.io/storj/storagenode/pieces.(*CacheService).Run:54\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2.1:87\n\truntime/pprof.Do:40\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2:86\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:57\n--- lstat config/storage/blobs/pmw6tvzmf2jv6giyybmmvl4o2ahqlaldsaeha4yx74n5aaaaaaaa/dk/fzys34za3bz2sxdgaut2elcyppz3yuxisro3br4fugf4gxd3iq.sj1: structure needs cleaning\n\tstorj.io/storj/storage/filestore.walkNamespaceWithPrefix:788\n\tstorj.io/storj/storage/filestore.(*Dir).walkNamespaceInPath:725\n\tstorj.io/storj/storage/filestore.(*Dir).WalkNamespace:685\n\tstorj.io/storj/storage/filestore.(*blobStore).WalkNamespace:284\n\tstorj.io/storj/storagenode/pieces.(*Store).WalkSatellitePieces:497\n\tstorj.io/storj/storagenode/pieces.(*Store).SpaceUsedTotalAndBySatellite:662\n\tstorj.io/storj/storagenode/pieces.(*CacheService).Run:54\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2.1:87\n\truntime/pprof.Do:40\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2:86\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:57\n--- lstat config/storage/blobs/6r2fgwqz3manwt4aogq343bfkh2n5vvg4ohqqgggrrunaaaaaaaa/hq/a6ohctx4ymlesavuvud3s3s5n5ljd4d3yb6vlcgetzgsknrdzq.sj1: structure needs cleaning\n\tstorj.io/storj/storage/filestore.walkNamespaceWithPrefix:788\n\tstorj.io/storj/storage/filestore.(*Dir).walkNamespaceInPath:725\n\tstorj.io/storj/storage/filestore.(*Dir).WalkNamespace:685\n\tstorj.io/storj/storage/filestore.(*blobStore).WalkNamespace:284\n\tstorj.io/storj/storagenode/pieces.(*Store).WalkSatellitePieces:497\n\tstorj.io/storj/storagenode/pieces.(*Store).SpaceUsedTotalAndBySatellite:662\n\tstorj.io/storj/storagenode/pieces.(*CacheService).Run:54\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2.1:87\n\truntime/pprof.Do:40\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2:86\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:57\n--- context canceled\n--- context canceled"}
2021-07-23T22:13:54.958Z        ERROR   pieces:trash    emptying trash failed                                                         {"error": "pieces error: filestore error: context canceled", "errorVerbose": "pieces error: filestore error: context canceled\n\tstorj.io/storj/storage/filestore.(*blobStore).EmptyTrash:154\n\tstorj.io/storj/storagenode/pieces.(*BlobsUsageCache).EmptyTrash:310\n\tstorj.io/storj/storagenode/pieces.(*Store).EmptyTrash:367\n\tstorj.io/storj/storagenode/pieces.(*TrashChore).Run.func1:51\n\tstorj.io/common/sync2.(*Cycle).Run:92\n\tstorj.io/common/sync2.(*Cycle).Start.func1:71\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:57"}

Alexey · July 23, 2021, 10:39pm

There are two problems in your question. The first one - missed pieces, this will not lead to be offline, the second one is not discovered yet which leads to restart of the container.

To determine the first one - please, search all records for this Piece ID in your logs, when is it uploaded, and was it deleted or just vanished?

The second one - please, search for ERROR records in your logs, I would like to see the last 5 such errors, especially FATAL ones (not about the moving to trash).

4ich · July 26, 2021, 7:10pm

Ok i investigated a bit more…
I have high iowaits from the storagenode processes

And in the logs i have
2021-07-26T19:07:20.176Z ERROR db Unable to read the disk, please verify the disk is not corrupt
Iam checking the raid right now but since All seems fine when i stop the node i dont think i will find any bad blocks…

Any suggstions?
Maybe move the db to sd card or even to a usb (2.0) attached ssd/hdd/stick

Alexey · July 26, 2021, 7:35pm

This is not good. What type of RAID do you use? What is the filesystem? How these disks are connected to the raspberry?

How do you run the docker run? With sudo or without?
If without - please, show result of the command

docker inspect storagenode -f '{{.Mounts}}'

4ich · July 26, 2021, 7:51pm

I run the container without sudo

This is the result of docker inspect storagenaspi -f ‘{{.Mounts}}’
[{bind /mnt/r5/storj/data /app/config true rprivate} {bind /mnt/r5/storj/identity /app/identity true rprivate}]

i have this:
https://wiki.radxa.com/Dual_Quad_SATA_HAT
and 4x2tb drives in Raid 5 since i use it as NAS too …
and as said its an rpi4b with 8 gb ram

Alexey · July 26, 2021, 8:08pm

Please, show result of the command:

ls -l /mnt/r5/storj/data/storage

and

df -HT

How did you build the RAID? I assume with OMV.

4ich · July 26, 2021, 9:24pm

thanks alexey … seems i found the issue. … i think i have a faulty drive or at least a pre failing one
I use plain raspbian with mdadm

ls: cannot access '/mnt/r5/storj/data/storage/info.db-shm': No such file or directory
total 63400
-rw-r--r-- 1 root root 21786624 Jul 26 20:20 bandwidth.db
-rw-r--r-- 1 root root    32768 Jul 26 22:21 bandwidth.db-shm
-rw-r--r-- 1 root root   684032 Jul 26 20:24 bandwidth.db-wal
drwx------ 9 root root     4096 Jan  7  2021 blobs
drwx------ 2 root root     4096 Jul 26 20:12 garbage
-rw-r--r-- 1 root root    65536 Jul 26 20:20 heldamount.db
-rw-r--r-- 1 root root    32768 Jul 26 22:21 heldamount.db-shm
-rw-r--r-- 1 root root  1623312 Jul 26 20:21 heldamount.db-wal
-rw-r--r-- 1 root root    16384 Jul 26 22:21 info.db
**-????????? ? ?    ?           ?            ? info.db-shm <---doesnt seem correct?**
-rw-r--r-- 1 root root        0 Jul 26 22:23 info.db-wal
-rw-r--r-- 1 root root    24576 Jul 26 20:20 notifications.db
-rw-r--r-- 1 root root    32768 Jul 26 22:21 notifications.db-shm
-rw-r--r-- 1 root root    32992 Jul 26 20:20 notifications.db-wal
-rw-r--r-- 1 root root 34328576 Jul 26 20:20 orders.db
-rw-r--r-- 1 root root    32768 Jul 26 22:21 orders.db-shm
-rw-r--r-- 1 root root    41232 Jul 26 20:20 orders.db-wal
-rw-r--r-- 1 root root  2101248 Jul 26 20:20 piece_expiration.db
-rw-r--r-- 1 root root    32768 Jul 26 22:21 piece_expiration.db-shm
-rw-r--r-- 1 root root    41232 Jul 26 20:20 piece_expiration.db-wal
-rw-r--r-- 1 root root    24576 Jul 26 20:20 pieceinfo.db
-rw-r--r-- 1 root root    32768 Jul 26 22:21 pieceinfo.db-shm
-rw-r--r-- 1 root root    32992 Jul 26 20:20 pieceinfo.db-wal
-rw-r--r-- 1 root root    24576 Jul 26 20:20 piece_spaced_used.db
-rw-r--r-- 1 root root    32768 Jul 26 22:21 piece_spaced_used.db-shm
-rw-r--r-- 1 root root    32992 Jul 26 20:20 piece_spaced_used.db-wal
-rw-r--r-- 1 root root    24576 Jul 26 20:20 pricing.db
-rw-r--r-- 1 root root    32768 Jul 26 22:21 pricing.db-shm
-rw-r--r-- 1 root root    82432 Jul 26 20:20 pricing.db-wal
-rw-r--r-- 1 root root    40960 Jul 26 20:20 reputation.db
-rw-r--r-- 1 root root    32768 Jul 26 22:21 reputation.db-shm
-rw-r--r-- 1 root root   189552 Jul 26 20:22 reputation.db-wal
-rw-r--r-- 1 root root    32768 Jul 26 20:20 satellites.db
-rw-r--r-- 1 root root    32768 Jul 26 22:21 satellites.db-shm
-rw-r--r-- 1 root root    41232 Jul 26 20:20 satellites.db-wal
-rw-r--r-- 1 root root    24576 Jul 26 20:20 secret.db
-rw-r--r-- 1 root root    32768 Jul 26 22:21 secret.db-shm
-rw-r--r-- 1 root root    32992 Jul 26 20:20 secret.db-wal
-rw-r--r-- 1 root root       32 Dec  6  2020 storage-dir-verification
-rw-r--r-- 1 root root   348160 Jul 26 20:20 storage_usage.db
-rw-r--r-- 1 root root    32768 Jul 26 22:21 storage_usage.db-shm
-rw-r--r-- 1 root root   193672 Jul 26 20:20 storage_usage.db-wal
drwx------ 2 root root  2473984 Jul 26 20:24 temp
drwx------ 8 root root     4096 Jan 12  2021 trash
-rw-r--r-- 1 root root    20480 Jul 26 20:20 used_serial.db
-rw-r--r-- 1 root root    32768 Jul 26 22:21 used_serial.db-shm
-rw-r--r-- 1 root root    41232 Jul 26 20:20 used_serial.db-wal
-rw------- 1 root root        0 Nov 12  2020 write-test084195137
-rw------- 1 root root        0 Nov 10  2020 write-test290185210
-rw------- 1 root root        0 Nov 10  2020 write-test303027288
-rw------- 1 root root        0 Nov 22  2020 write-test512637377
-rw------- 1 root root        0 Nov 11  2020 write-test608393685
-rw------- 1 root root        0 Nov 16  2020 write-test910705580

Filesystem     Type      Size  Used Avail Use% Mounted on
/dev/root      ext4       32G  3.6G   27G  12% /
devtmpfs       devtmpfs  4.0G     0  4.0G   0% /dev
tmpfs          tmpfs     4.2G   29k  4.2G   1% /dev/shm
tmpfs          tmpfs     4.2G   11M  4.2G   1% /run
tmpfs          tmpfs     5.3M  4.1k  5.3M   1% /run/lock
tmpfs          tmpfs     4.2G     0  4.2G   0% /sys/fs/cgroup
/dev/mmcblk0p1 vfat      265M   56M  209M  22% /boot
tmpfs          tmpfs     826M     0  826M   0% /run/user/1000
overlay        overlay    32G  3.6G   27G  12% /var/lib/docker/overlay2/b064aed017c94b5bbe8b78bfe3eff0aa9091db630e6e971d20ff7a567cee9729/merged
overlay        overlay    32G  3.6G   27G  12% /var/lib/docker/overlay2/92efe3a8997b0794a6b2731125c532d3afead5da386cc0f6053a2a96025bf0a3/merged
/dev/md0       ext4      6.0T  5.2T  466G  92% /mnt/r5

BrightSilence · July 27, 2021, 6:38am

Be sure to run fsck on it. It might be just a file system issue and not necessarily a failing disk.

Alexey · July 27, 2021, 7:43am

I’m agree with @BrightSilence - you need to run fsck, maybe it would be enough.
I would also check the S.M.A.R.T. for every disk.

4ich · July 28, 2021, 4:02pm

Is the offline dq /penalty already a thing?

Alexey · July 28, 2021, 6:37pm

The DQ/penalty for lost data have not disabled.

For downtime your node could be suspended (if online score is fall below 60%). The DQ will be applied if your node would be offline for more than 30 days.

4ich · July 28, 2021, 6:50pm

okay then due to performnce issues when storj is running while rebuild…
i try to dd the faulty disk, skip the errors, add it to the raid and finally repair the failed sectors

4ich · July 29, 2021, 4:55am

my raid looks fine now but for some reason it impacted the “master database” …

2021-07-29T04:53:25.318Z        INFO    Operator email  {"Address": "censored@gmail.com"}
2021-07-29T04:53:25.318Z        INFO    Operator wallet {"Address": "someaddress"}
Error: Error starting master database on storagenode: database: unable to open database file: structure needs cleaning
        storj.io/storj/storagenode/storagenodedb.(*DB).openDatabase:323
        storj.io/storj/storagenode/storagenodedb.(*DB).openExistingDatabase:305
        storj.io/storj/storagenode/storagenodedb.(*DB).openDatabases:281
        storj.io/storj/storagenode/storagenodedb.OpenExisting:248
        main.cmdRun:160
        storj.io/private/process.cleanup.func1.4:363
        storj.io/private/process.cleanup.func1:381
        github.com/spf13/cobra.(*Command).execute:852
        github.com/spf13/cobra.(*Command).ExecuteC:960
        github.com/spf13/cobra.(*Command).Execute:897
        storj.io/private/process.ExecWithCustomConfig:88
        storj.io/private/process.ExecCustomDebug:70
        main.main:385
        runtime.main:204

meantime ill try to determine wich one to fix … with

4ich · July 29, 2021, 4:59am

yeeay

/mnt/r5/storj/data/storage# sqlite3 /mnt/r5/storj/data/storage/orders.db “PRAGMA integrity_check;”
Error: unable to open database file

Alexey · July 29, 2021, 6:01am

Seems that file is not recovered.
You can re-create it, using this manual: How to fix database: file is not a database error – Storj

4ich · July 29, 2021, 3:02pm

it doesnt help
after loading the config this comes immediately:
Error: trust: malformed cache: invalid character ‘\x01’ looking for beginning of value

Alexey · July 29, 2021, 6:10pm

This is another corrupted file, see Search results for 'malformed cache: invalid character' - Storj Community Forum (official)

4ich · July 29, 2021, 6:55pm

thanks … seems back online…
but
i had to drop that table like in Malformed database, fixed database, now won't start after i removed the trust-cache.json

2020-08-17T15:12:47.601-0500	FATAL	Unrecoverable error	{"error": "Error creating tables for master database on storagenode: migrate: index idx_order_archived_at already exists\n\tstorj.io/storj/private/migrate.SQL.Run:274\n\tstorj.io/storj/private/migrate.(*Migration).Run.func1:179\n\tstorj.io/storj/private/dbutil/txutil.withTxOnce:75\n\tstorj.io/storj/private/dbutil/txutil.WithTx:36\n\tstorj.io/storj/private/migrate.(*Migration).Run:178\n\tstorj.io/storj/storagenode/storagenodedb.(*DB).MigrateToLatest:295\n\tmain.cmdRun:183\n\tstorj.io/private/process.cleanup.func1.4:353\n\tstorj.io/private/process.cleanup.func1:371\n\tgithub.com/spf13/cobra.(*Command).execute:840\n\tgithub.com/spf13/cobra.(*Command).ExecuteC:945\n\tgithub.com/spf13/cobra.(*Command).Execute:885\n\tstorj.io/private/process.ExecWithCustomConfig:88\n\tstorj.io/private/process.Exec:65\n\tmain.(*service).Execute.func1:66\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:57", "errorVerbose": "Error creating tables for master database on storagenode: migrate: index idx_order_archived_at already exists\n\tstorj.io/storj/private/migrate.SQL.Run:274\n\tstorj.io/storj/private/migrate.(*Migration).Run.func1:179\n\tstorj.io/storj/private/dbutil/txutil.withTxOnce:75\n\tstorj.io/storj/private/dbutil/txutil.WithTx:36\n\tstorj.io/storj/private/migrate.(*Migration).Run:178\n\tstorj.io/storj/storagenode/storagenodedb.(*DB).MigrateToLatest:295\n\tmain.cmdRun:183\n\tstorj.io/private/process.cleanup.func1.4:353\n\tstorj.io/private/process.cleanup.func1:371\n\tgithub.com/spf13/cobra.(*Command).execute:840\n\tgithub.com/spf13/cobra.(*Command).ExecuteC:945\n\tgithub.com/spf13/cobra.(*Command).Execute:885\n\tstorj.io/private/process.ExecWithCustomConfig:88\n\tstorj.io/private/process.Exec:65\n\tmain.(*service).Execute.func1:66\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:57\n\tmain.cmdRun:185\n\tstorj.io/private/process.cleanup.func1.4:353\n\tstorj.io/private/process.cleanup.func1:371\n\tgithub.com/spf13/cobra.(*Command).execute:840\n\tgithub.com/spf13/cobra.(*Command).ExecuteC:945\n\tgithub.com/spf13/cobra.(*Command).Execute:885\n\tstorj.io/private/process.ExecWithCustomConfig:88\n\tstorj.io/private/process.Exec:65\n\tmain.(*service).Execute.func1:66\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:57"}

then i got a ton of errors but it started downloading / uploading … i hope it didnt DQ because of downtime / missing or faulty pieces