Node crashed and restarted after some "error getting current used space", "lstat […] bad message" and "Unrecoverable error"

Pac · March 30, 2021, 9:12pm

Hi there

I noticed one of my node was suspiciously very active, and found out that was because it crashed at some point and restarted: the disk activity I’m seing is simply the filewalker rebrowsing the whole disk.

Here are the few log lines around when it crashed and abruptluy restarted:

[...]
2021-03-30T17:15:21.127Z        INFO    piecestore      download started        {"Piece ID": "BSBUCI6M2C3AKHRSJ3JHM4PFVND55RNJA7XUR4QEEVUZPYVS62DA", "Satellite ID": "12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs", "Action": "GET_AUDIT"}
2021-03-30T17:15:21.536Z        INFO    piecestore      downloaded      {"Piece ID": "BSBUCI6M2C3AKHRSJ3JHM4PFVND55RNJA7XUR4QEEVUZPYVS62DA", "Satellite ID": "12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs", "Action": "GET_AUDIT"}
2021-03-30T17:15:22.561Z        ERROR   piecestore:cache        error getting current used space:       {"error": "lstat config/storage/blobs/ukfu6bhbboxilvt7jrwlqk7y2tapb5d2r2tsmj2sjxvw5qaaaaaa/7d/cu3edg2izxgw6yho7yjwzmiwbj23qas5nzlhmudfdvtysiet2a.sj1: bad message", "errorVerbose": "lstat config/storage/blobs/ukfu6bhbboxilvt7jrwlqk7y2tapb5d2r2tsmj2sjxvw5qaaaaaa/7d/cu3edg2izxgw6yho7yjwzmiwbj23qas5nzlhmudfdvtysiet2a.sj1: bad message\n\tstorj.io/storj/storage/filestore.walkNamespaceWithPrefix:787\n\tstorj.io/storj/storage/filestore.(*Dir).walkNamespaceInPath:725\n\tstorj.io/storj/storage/filestore.(*Dir).WalkNamespace:685\n\tstorj.io/storj/storage/filestore.(*blobStore).WalkNamespace:280\n\tstorj.io/storj/storagenode/pieces.(*Store).WalkSatellitePieces:496\n\tstorj.io/storj/storagenode/pieces.(*Store).SpaceUsedTotalAndBySatellite:661\n\tstorj.io/storj/storagenode/pieces.(*CacheService).Run:54\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2.1:81\n\truntime/pprof.Do:40\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2:80\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:57"}
2021-03-30T17:15:22.561Z        ERROR   services        unexpected shutdown of a runner {"name": "piecestore:cache", "error": "lstat config/storage/blobs/ukfu6bhbboxilvt7jrwlqk7y2tapb5d2r2tsmj2sjxvw5qaaaaaa/7d/cu3edg2izxgw6yho7yjwzmiwbj23qas5nzlhmudfdvtysiet2a.sj1: bad message", "errorVerbose": "lstat config/storage/blobs/ukfu6bhbboxilvt7jrwlqk7y2tapb5d2r2tsmj2sjxvw5qaaaaaa/7d/cu3edg2izxgw6yho7yjwzmiwbj23qas5nzlhmudfdvtysiet2a.sj1: bad message\n\tstorj.io/storj/storage/filestore.walkNamespaceWithPrefix:787\n\tstorj.io/storj/storage/filestore.(*Dir).walkNamespaceInPath:725\n\tstorj.io/storj/storage/filestore.(*Dir).WalkNamespace:685\n\tstorj.io/storj/storage/filestore.(*blobStore).WalkNamespace:280\n\tstorj.io/storj/storagenode/pieces.(*Store).WalkSatellitePieces:496\n\tstorj.io/storj/storagenode/pieces.(*Store).SpaceUsedTotalAndBySatellite:661\n\tstorj.io/storj/storagenode/pieces.(*CacheService).Run:54\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2.1:81\n\truntime/pprof.Do:40\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2:80\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:57"}
2021-03-30T17:15:22.592Z        INFO    piecestore      downloaded      {"Piece ID": "UUX5CVLXI3NSHKTJIRF6LH2W7QJHGB5GRXNIUMC5HZPZY2A6YA3A", "Satellite ID": "1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE", "Action": "GET"}
2021-03-30T17:15:25.255Z        FATAL   Unrecoverable error     {"error": "lstat config/storage/blobs/ukfu6bhbboxilvt7jrwlqk7y2tapb5d2r2tsmj2sjxvw5qaaaaaa/7d/cu3edg2izxgw6yho7yjwzmiwbj23qas5nzlhmudfdvtysiet2a.sj1: bad message", "errorVerbose": "lstat config/storage/blobs/ukfu6bhbboxilvt7jrwlqk7y2tapb5d2r2tsmj2sjxvw5qaaaaaa/7d/cu3edg2izxgw6yho7yjwzmiwbj23qas5nzlhmudfdvtysiet2a.sj1: bad message\n\tstorj.io/storj/storage/filestore.walkNamespaceWithPrefix:787\n\tstorj.io/storj/storage/filestore.(*Dir).walkNamespaceInPath:725\n\tstorj.io/storj/storage/filestore.(*Dir).WalkNamespace:685\n\tstorj.io/storj/storage/filestore.(*blobStore).WalkNamespace:280\n\tstorj.io/storj/storagenode/pieces.(*Store).WalkSatellitePieces:496\n\tstorj.io/storj/storagenode/pieces.(*Store).SpaceUsedTotalAndBySatellite:661\n\tstorj.io/storj/storagenode/pieces.(*CacheService).Run:54\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2.1:81\n\truntime/pprof.Do:40\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2:80\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:57"}
2021-03-30T17:15:28.099Z        INFO    Configuration loaded    {"Location": "/app/config/config.yaml"}
2021-03-30T17:15:28.105Z        INFO    Operator email  {"Address": "XXXX@XXXX.com"}
2021-03-30T17:15:28.105Z        INFO    Operator wallet {"Address": "0xXXXX"}
[...]

The node is now back on track, receiving and sending data normally and it doesn’t look like anything is wrong in the log after this crash.

Scores look good too on the dashboard:

(although it was only 4 hours ago, so maybe they’re not up-to-date yet)

Is there anything I should check, besides maybe running a disk check just in case?

nerdatwork · March 31, 2021, 2:07am

You should definitely do that before any major file loss.

Pac · April 1, 2021, 9:03am

Crashed again today

2021-04-01T06:18:42.918Z        ERROR   piecestore:cache        error getting current used space:       {"error": "lstat config/storage/blobs/ukfu6bhbboxilvt7jrwlqk7y2tapb5d2r2tsmj2sjxvw5qaaaaaa/7d/cu3edg2izxgw6yho7yjwzmiwbj23qas5nzlhmudfdvtysiet2a.sj1: bad message", "errorVerbose": "lstat config/storage/blobs/ukfu6bhbboxilvt7jrwlqk7y2tapb5d2r2tsmj2sjxvw5qaaaaaa/7d/cu3edg2izxgw6yho7yjwzmiwbj23qas5nzlhmudfdvtysiet2a.sj1: bad message\n\tstorj.io/storj/storage/filestore.walkNamespaceWithPrefix:787\n\tstorj.io/storj/storage/filestore.(*Dir).walkNamespaceInPath:725\n\tstorj.io/storj/storage/filestore.(*Dir).WalkNamespace:685\n\tstorj.io/storj/storage/filestore.(*blobStore).WalkNamespace:280\n\tstorj.io/storj/storagenode/pieces.(*Store).WalkSatellitePieces:496\n\tstorj.io/storj/storagenode/pieces.(*Store).SpaceUsedTotalAndBySatellite:661\n\tstorj.io/storj/storagenode/pieces.(*CacheService).Run:54\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2.1:81\n\truntime/pprof.Do:40\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2:80\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:57"}
2021-04-01T06:18:42.918Z        ERROR   services        unexpected shutdown of a runner {"name": "piecestore:cache", "error": "lstat config/storage/blobs/ukfu6bhbboxilvt7jrwlqk7y2tapb5d2r2tsmj2sjxvw5qaaaaaa/7d/cu3edg2izxgw6yho7yjwzmiwbj23qas5nzlhmudfdvtysiet2a.sj1: bad message", "errorVerbose": "lstat config/storage/blobs/ukfu6bhbboxilvt7jrwlqk7y2tapb5d2r2tsmj2sjxvw5qaaaaaa/7d/cu3edg2izxgw6yho7yjwzmiwbj23qas5nzlhmudfdvtysiet2a.sj1: bad message\n\tstorj.io/storj/storage/filestore.walkNamespaceWithPrefix:787\n\tstorj.io/storj/storage/filestore.(*Dir).walkNamespaceInPath:725\n\tstorj.io/storj/storage/filestore.(*Dir).WalkNamespace:685\n\tstorj.io/storj/storage/filestore.(*blobStore).WalkNamespace:280\n\tstorj.io/storj/storagenode/pieces.(*Store).WalkSatellitePieces:496\n\tstorj.io/storj/storagenode/pieces.(*Store).SpaceUsedTotalAndBySatellite:661\n\tstorj.io/storj/storagenode/pieces.(*CacheService).Run:54\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2.1:81\n\truntime/pprof.Do:40\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2:80\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:57"}
2021-04-01T06:18:46.259Z        FATAL   Unrecoverable error     {"error": "lstat config/storage/blobs/ukfu6bhbboxilvt7jrwlqk7y2tapb5d2r2tsmj2sjxvw5qaaaaaa/7d/cu3edg2izxgw6yho7yjwzmiwbj23qas5nzlhmudfdvtysiet2a.sj1: bad message", "errorVerbose": "lstat config/storage/blobs/ukfu6bhbboxilvt7jrwlqk7y2tapb5d2r2tsmj2sjxvw5qaaaaaa/7d/cu3edg2izxgw6yho7yjwzmiwbj23qas5nzlhmudfdvtysiet2a.sj1: bad message\n\tstorj.io/storj/storage/filestore.walkNamespaceWithPrefix:787\n\tstorj.io/storj/storage/filestore.(*Dir).walkNamespaceInPath:725\n\tstorj.io/storj/storage/filestore.(*Dir).WalkNamespace:685\n\tstorj.io/storj/storage/filestore.(*blobStore).WalkNamespace:280\n\tstorj.io/storj/storagenode/pieces.(*Store).WalkSatellitePieces:496\n\tstorj.io/storj/storagenode/pieces.(*Store).SpaceUsedTotalAndBySatellite:661\n\tstorj.io/storj/storagenode/pieces.(*CacheService).Run:54\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2.1:81\n\truntime/pprof.Do:40\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2:80\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:57"}

(then the node restarts)

Note: A badblocks -v /dev/sdb1 > badsectors-disk-1.txt is currently running on my node.
Nothing got reported in the output file so far.

But the “lstat […]: bad message” is concerning
I wouldn’t expect the Storj software to crash when this happens though!

EDIT: I see it failed on the exact same file, so I assume it will fail over and over during the filewalker browsing, until I delete/fix this file…

Pac · April 1, 2021, 5:08pm

Update:
badblocks -v /dev/sdb1 > badsectors-disk-1.txt finished and reported nothing, the output file is empty. The command output is as follows:

pi@raspberrypi:~ $ sudo badblocks -v /dev/sdb1 > badsectors-disk-1.txt
Checking blocks 0 to 1953513471
Checking for bad blocks (read-only test): done
Pass completed, 0 bad blocks found. (0/0/0 errors)

Moving on to fsck -y /dev/sdb1:
(after unmounting the drive)

root@raspberrypi:/home/pi/storj# fsck -y /dev/sdb1
fsck from util-linux 2.33.1
e2fsck 1.44.5 (15-Dec-2018)
2TB contains a file system with errors, check forced.

Pass 1: Checking inodes, blocks, and sizes
Inode 35658038 seems to contain garbage.  Clear<y>? yes
Inode 35658039 passes checks, but checksum does not match inode.  Fix? yes
Inode 35658040 passes checks, but checksum does not match inode.  Fix? yes
Inode 35658041 seems to contain garbage.  Clear<y>? yes

[...]

Pass 2: Checking directory structure
Entry 'cu3edg2izxgw6yho7yjwzmiwbj23qas5nzlhmudfdvtysiet2a.sj1' in /storj_node_1/storage/blobs/ukfu6bhbboxilvt7jrwlqk7y2tapb5d2r2tsmj2sjxvw5qaaaaaa/7d (35658018) has deleted/unused inode 35658038.  Clear? yes

Pass 3: Checking directory connectivity

Pass 4: Checking reference counts
Unattached inode 36575245
Connect to /lost+found? yes

Inode 36575245 ref count is 2, should be 1.  Fix? yes

Pass 5: Checking group summary information
Block bitmap differences:  -(143290698--143291264) +(173800873--173800925)
Fix? yes

Free blocks count wrong for group #4372 (2492, counted=3059).
Fix? yes

Free blocks count wrong for group #5303 (1914, counted=1861).
Fix? yes

Free blocks count wrong (29733072, counted=29733586).
Fix? yes

Inode bitmap differences:  -35658038 +36575245
Fix? yes

Free inodes count wrong for group #4352 (7659, counted=7660).
Fix? yes

Free inodes count wrong for group #4464 (7834, counted=7833).
Fix? yes


2TB: ***** FILE SYSTEM WAS MODIFIED *****
2TB: 2097452/122101760 files (7.7% non-contiguous), 458644782/488378368 blocks

So, it did find some problems and the file that was causing the issue.
I also ended up with a lost+found directory containing one file. I guess I can get rid of it right? It’s lost for good and that’s it.

I assumed I could remount the disk, and restart the node to see if it it could run the filewalker process successfully this time (it’s in progress).

But if any of you guys would like to add some insights on what to check on top of all that, or what could have caused such issues, please feel free

Pac · April 5, 2021, 10:05pm

After the fsck run that fixed a few issues on the disk (see above), my node has now been running fine for 100+ hours.

So it fixed the problem for now
Thx @nerdatwork.