I noticed one of my node was suspiciously very active, and found out that was because it crashed at some point and restarted: the disk activity I’m seing is simply the filewalker rebrowsing the whole disk.
Here are the few log lines around when it crashed and abruptluy restarted:
[...]
2021-03-30T17:15:21.127Z INFO piecestore download started {"Piece ID": "BSBUCI6M2C3AKHRSJ3JHM4PFVND55RNJA7XUR4QEEVUZPYVS62DA", "Satellite ID": "12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs", "Action": "GET_AUDIT"}
2021-03-30T17:15:21.536Z INFO piecestore downloaded {"Piece ID": "BSBUCI6M2C3AKHRSJ3JHM4PFVND55RNJA7XUR4QEEVUZPYVS62DA", "Satellite ID": "12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs", "Action": "GET_AUDIT"}
2021-03-30T17:15:22.561Z ERROR piecestore:cache error getting current used space: {"error": "lstat config/storage/blobs/ukfu6bhbboxilvt7jrwlqk7y2tapb5d2r2tsmj2sjxvw5qaaaaaa/7d/cu3edg2izxgw6yho7yjwzmiwbj23qas5nzlhmudfdvtysiet2a.sj1: bad message", "errorVerbose": "lstat config/storage/blobs/ukfu6bhbboxilvt7jrwlqk7y2tapb5d2r2tsmj2sjxvw5qaaaaaa/7d/cu3edg2izxgw6yho7yjwzmiwbj23qas5nzlhmudfdvtysiet2a.sj1: bad message\n\tstorj.io/storj/storage/filestore.walkNamespaceWithPrefix:787\n\tstorj.io/storj/storage/filestore.(*Dir).walkNamespaceInPath:725\n\tstorj.io/storj/storage/filestore.(*Dir).WalkNamespace:685\n\tstorj.io/storj/storage/filestore.(*blobStore).WalkNamespace:280\n\tstorj.io/storj/storagenode/pieces.(*Store).WalkSatellitePieces:496\n\tstorj.io/storj/storagenode/pieces.(*Store).SpaceUsedTotalAndBySatellite:661\n\tstorj.io/storj/storagenode/pieces.(*CacheService).Run:54\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2.1:81\n\truntime/pprof.Do:40\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2:80\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:57"}
2021-03-30T17:15:22.561Z ERROR services unexpected shutdown of a runner {"name": "piecestore:cache", "error": "lstat config/storage/blobs/ukfu6bhbboxilvt7jrwlqk7y2tapb5d2r2tsmj2sjxvw5qaaaaaa/7d/cu3edg2izxgw6yho7yjwzmiwbj23qas5nzlhmudfdvtysiet2a.sj1: bad message", "errorVerbose": "lstat config/storage/blobs/ukfu6bhbboxilvt7jrwlqk7y2tapb5d2r2tsmj2sjxvw5qaaaaaa/7d/cu3edg2izxgw6yho7yjwzmiwbj23qas5nzlhmudfdvtysiet2a.sj1: bad message\n\tstorj.io/storj/storage/filestore.walkNamespaceWithPrefix:787\n\tstorj.io/storj/storage/filestore.(*Dir).walkNamespaceInPath:725\n\tstorj.io/storj/storage/filestore.(*Dir).WalkNamespace:685\n\tstorj.io/storj/storage/filestore.(*blobStore).WalkNamespace:280\n\tstorj.io/storj/storagenode/pieces.(*Store).WalkSatellitePieces:496\n\tstorj.io/storj/storagenode/pieces.(*Store).SpaceUsedTotalAndBySatellite:661\n\tstorj.io/storj/storagenode/pieces.(*CacheService).Run:54\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2.1:81\n\truntime/pprof.Do:40\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2:80\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:57"}
2021-03-30T17:15:22.592Z INFO piecestore downloaded {"Piece ID": "UUX5CVLXI3NSHKTJIRF6LH2W7QJHGB5GRXNIUMC5HZPZY2A6YA3A", "Satellite ID": "1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE", "Action": "GET"}
2021-03-30T17:15:25.255Z FATAL Unrecoverable error {"error": "lstat config/storage/blobs/ukfu6bhbboxilvt7jrwlqk7y2tapb5d2r2tsmj2sjxvw5qaaaaaa/7d/cu3edg2izxgw6yho7yjwzmiwbj23qas5nzlhmudfdvtysiet2a.sj1: bad message", "errorVerbose": "lstat config/storage/blobs/ukfu6bhbboxilvt7jrwlqk7y2tapb5d2r2tsmj2sjxvw5qaaaaaa/7d/cu3edg2izxgw6yho7yjwzmiwbj23qas5nzlhmudfdvtysiet2a.sj1: bad message\n\tstorj.io/storj/storage/filestore.walkNamespaceWithPrefix:787\n\tstorj.io/storj/storage/filestore.(*Dir).walkNamespaceInPath:725\n\tstorj.io/storj/storage/filestore.(*Dir).WalkNamespace:685\n\tstorj.io/storj/storage/filestore.(*blobStore).WalkNamespace:280\n\tstorj.io/storj/storagenode/pieces.(*Store).WalkSatellitePieces:496\n\tstorj.io/storj/storagenode/pieces.(*Store).SpaceUsedTotalAndBySatellite:661\n\tstorj.io/storj/storagenode/pieces.(*CacheService).Run:54\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2.1:81\n\truntime/pprof.Do:40\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2:80\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:57"}
2021-03-30T17:15:28.099Z INFO Configuration loaded {"Location": "/app/config/config.yaml"}
2021-03-30T17:15:28.105Z INFO Operator email {"Address": "XXXX@XXXX.com"}
2021-03-30T17:15:28.105Z INFO Operator wallet {"Address": "0xXXXX"}
[...]
The node is now back on track, receiving and sending data normally and it doesn’t look like anything is wrong in the log after this crash.
Update: badblocks -v /dev/sdb1 > badsectors-disk-1.txt finished and reported nothing, the output file is empty. The command output is as follows:
pi@raspberrypi:~ $ sudo badblocks -v /dev/sdb1 > badsectors-disk-1.txt
Checking blocks 0 to 1953513471
Checking for bad blocks (read-only test): done
Pass completed, 0 bad blocks found. (0/0/0 errors)
Moving on to fsck -y /dev/sdb1: (after unmounting the drive)
root@raspberrypi:/home/pi/storj# fsck -y /dev/sdb1
fsck from util-linux 2.33.1
e2fsck 1.44.5 (15-Dec-2018)
2TB contains a file system with errors, check forced.
Pass 1: Checking inodes, blocks, and sizes
Inode 35658038 seems to contain garbage. Clear<y>? yes
Inode 35658039 passes checks, but checksum does not match inode. Fix? yes
Inode 35658040 passes checks, but checksum does not match inode. Fix? yes
Inode 35658041 seems to contain garbage. Clear<y>? yes
[...]
Pass 2: Checking directory structure
Entry 'cu3edg2izxgw6yho7yjwzmiwbj23qas5nzlhmudfdvtysiet2a.sj1' in /storj_node_1/storage/blobs/ukfu6bhbboxilvt7jrwlqk7y2tapb5d2r2tsmj2sjxvw5qaaaaaa/7d (35658018) has deleted/unused inode 35658038. Clear? yes
Pass 3: Checking directory connectivity
Pass 4: Checking reference counts
Unattached inode 36575245
Connect to /lost+found? yes
Inode 36575245 ref count is 2, should be 1. Fix? yes
Pass 5: Checking group summary information
Block bitmap differences: -(143290698--143291264) +(173800873--173800925)
Fix? yes
Free blocks count wrong for group #4372 (2492, counted=3059).
Fix? yes
Free blocks count wrong for group #5303 (1914, counted=1861).
Fix? yes
Free blocks count wrong (29733072, counted=29733586).
Fix? yes
Inode bitmap differences: -35658038 +36575245
Fix? yes
Free inodes count wrong for group #4352 (7659, counted=7660).
Fix? yes
Free inodes count wrong for group #4464 (7834, counted=7833).
Fix? yes
2TB: ***** FILE SYSTEM WAS MODIFIED *****
2TB: 2097452/122101760 files (7.7% non-contiguous), 458644782/488378368 blocks
So, it did find some problems and the file that was causing the issue.
I also ended up with a lost+found directory containing one file. I guess I can get rid of it right? It’s lost for good and that’s it.
I assumed I could remount the disk, and restart the node to see if it it could run the filewalker process successfully this time (it’s in progress).
But if any of you guys would like to add some insights on what to check on top of all that, or what could have caused such issues, please feel free