Node Restarting Frequently - Several Errors

sharkya · April 18, 2024, 12:53am

Storj node v1.99.3 keeps crashing since April 9.

OS: Ubuntu 22 (SSD)
RAM: 32 GB
Storage: 6x 6TB HDD RAID10 - Ext4 using mdadm - SATA

SMART tests on all drives show no errors and mdadm detail shows RAID is healthy. I even tried replacing a HDD with a bad sector and resynced the RAID. This rig is only running Storj, so other software causing problems is unlikely.

The node logs show several errors:

ERROR piecestore download failed {“Process”: “storagenode”, “Piece ID”: “V623P7SWJNHFUFWFCEP3K2SHMVCKQTE4VSEC6EE673TDINHK26WQ”, “Satellite ID”: “12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S”, “Action”: “GET”, “Offset”: 0, “Size”: 9728, “Remote Address”: “79.127.223.129:35470”, “error”: “untrusted: unable to get signee: trust: rpc: tcp connector failed: rpc: dial tcp: lookup us1.storj.io: operation was canceled”, “errorVerbose”: “untrusted: unable to get signee: trust: rpc: tcp connector failed: rpc: dial tcp: lookup us1.storj.io: operation was canceled\n\tstorj.io/storj/storagenode/piecestore.(*Endpoint).VerifyOrderLimitSignature:140\n\tstorj.io/storj/storagenode/piecestore.(*Endpoint).verifyOrderLimit:62\n\tstorj.io/storj/storagenode/piecestore.(*Endpoint).Download:621\n\tstorj.io/common/pb.DRPCPiecestoreDescription.Method.func2:302\n\tstorj.io/drpc/drpcmux.(*Mux).HandleRPC:33\n\tstorj.io/common/rpc/rpctracing.(*Handler).HandleRPC:61\n\tstorj.io/common/experiment.(*Handler).HandleRPC:42\n\tstorj.io/drpc/drpcserver.(*Server).handleRPC:124\n\tstorj.io/drpc/drpcserver.(*Server).ServeOne:66\n\tstorj.io/drpc/drpcserver.(*Server).Serve.func2:114\n\tstorj.io/drpc/drpcctx.(*Tracker).track:35”}
ERROR piecestore download failed {“Process”: “storagenode”, “Piece ID”: “WP5FEJXT2ULUSOYC6ZTWXNGHO4LXZE4WRZ4MVMOAZO4SMPAIA2CA”, “Satellite ID”: “12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S”, “Action”: “GET”, “Offset”: 0, “Size”: 4352, “Remote Address”: “79.127.201.210:41678”, “error”: “trust: rpc: tcp connector failed: rpc: context canceled”, “errorVerbose”: “trust: rpc: tcp connector failed: rpc: context canceled\n\tstorj.io/common/rpc.HybridConnector.DialContext.func1:190”}
ERROR piecestore error sending hash and order limit {“Process”: “storagenode”, “Piece ID”: “ELSPLWLYTQW3WNHDP2UCJD2KBB4VXZSK5RRVZPYKMW7V26G235GA”, “Satellite ID”: “121RTSDpyNZVcEU84Ticf2L1ntiuUimbWgfATz21tuvgk3vzoA6”, “Action”: “GET_REPAIR”, “Offset”: 0, “Size”: 18688, “Remote Address”: “49.13.226.201:47162”, “error”: “write tcp 172.17.0.3:28967->49.13.226.201:47162: write: broken pipe”, “errorVerbose”: “write tcp 172.17.0.3:28967->49.13.226.201:47162: write: broken pipe\n\tstorj.io/drpc/drpcstream.(*Stream).rawFlushLocked:401\n\tstorj.io/drpc/drpcstream.(*Stream).MsgSend:462\n\tstorj.io/common/pb.(*drpcPiecestore_DownloadStream).Send:408\n\tstorj.io/storj/storagenode/piecestore.(*Endpoint).Download.func5:720\n\tstorj.io/common/rpc/rpctimeout.Run.func1:22”}
ERROR piecestore download failed {“Process”: “storagenode”, “Piece ID”: “ELSPLWLYTQW3WNHDP2UCJD2KBB4VXZSK5RRVZPYKMW7V26G235GA”, “Satellite ID”: “121RTSDpyNZVcEU84Ticf2L1ntiuUimbWgfATz21tuvgk3vzoA6”, “Action”: “GET_REPAIR”, “Offset”: 0, “Size”: 18688, “Remote Address”: “49.13.226.201:47162”, “error”: “write tcp 172.17.0.3:28967->49.13.226.201:47162: write: broken pipe”, “errorVerbose”: “write tcp 172.17.0.3:28967->49.13.226.201:47162: write: broken pipe\n\tstorj.io/drpc/drpcstream.(*Stream).rawFlushLocked:401\n\tstorj.io/drpc/drpcstream.(*Stream).MsgSend:462\n\tstorj.io/common/pb.(*drpcPiecestore_DownloadStream).Send:408\n\tstorj.io/storj/storagenode/piecestore.(*Endpoint).Download.func5:720\n\tstorj.io/common/rpc/rpctimeout.Run.func1:22”}
ERROR piecestore upload failed {“Process”: “storagenode”, “Piece ID”: “LNLE5DFR4QEWD4623AMCISYUKIJPLHGIM7XTFXRLWCJUCHF2RILA”, “Satellite ID”: “12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S”, “Action”: “PUT”, “Remote Address”: “79.127.205.227:37068”, “Size”: 1900544, “error”: “manager closed: unexpected EOF”, “errorVerbose”: “manager closed: unexpected EOF\n\tgithub.com/jtolio/noiseconn.(*Conn).readMsg:225\n\tgithub.com/jtolio/noiseconn.(*Conn).Read:171\n\tstorj.io/drpc/drpcwire.(*Reader).ReadPacketUsing:96\n\tstorj.io/drpc/drpcmanager.(*Manager).manageReader:226”}
ERROR services unexpected shutdown of a runner {“Process”: “storagenode”, “name”: “piecestore:monitor”, “error”: “piecestore monitor: timed out after 1m0s while verifying writability of storage directory”, “errorVerbose”: “piecestore monitor: timed out after 1m0s while verifying writability of storage directory\n\tstorj.io/storj/storagenode/monitor.(*Service).Run.func2.1:178\n\tstorj.io/common/sync2.(*Cycle).Run:160\n\tstorj.io/storj/storagenode/monitor.(*Service).Run.func2:167\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:75”}
ERROR pieces failed to lazywalk space used by satellite {“Process”: “storagenode”, “error”: “lazyfilewalker: signal: killed”, “errorVerbose”: “lazyfilewalker: signal: killed\n\tstorj.io/storj/storagenode/pieces/lazyfilewalker.(*process).run:83\n\tstorj.io/storj/storagenode/pieces/lazyfilewalker.(*Supervisor).WalkAndComputeSpaceUsedBySatellite:105\n\tstorj.io/storj/storagenode/pieces.(*Store).SpaceUsedTotalAndBySatellite:718\n\tstorj.io/storj/storagenode/pieces.(*CacheService).Run:58\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2.1:87\n\truntime/pprof.Do:51\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2:86\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:75”, “Satellite ID”: “1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE”}
ERROR piecestore upload failed {“Process”: “storagenode”, “Piece ID”: “J7EY3VSVH5Q37EQG4XMX6XKWD464H4ZGZRNTW3QB4WSZWR7IAO2A”, “Satellite ID”: “12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs”, “Action”: “PUT”, “Remote Address”: “79.127.226.98:54252”, “Size”: 0, “error”: “order: grace period passed for order limit”, “errorVerbose”: “order: grace period passed for order limit\n\tstorj.io/storj/storagenode/orders.(*FileStore).BeginEnqueue:86\n\tstorj.io/storj/storagenode/piecestore.(*Endpoint).beginSaveOrder:885\n\tstorj.io/storj/storagenode/piecestore.(*Endpoint).Upload:410\n\tstorj.io/common/pb.DRPCPiecestoreDescription.Method.func1:294\n\tstorj.io/drpc/drpcmux.(*Mux).HandleRPC:33\n\tstorj.io/common/rpc/rpctracing.(*Handler).HandleRPC:61\n\tstorj.io/common/experiment.(*Handler).HandleRPC:42\n\tstorj.io/drpc/drpcserver.(*Server).handleRPC:124\n\tstorj.io/drpc/drpcserver.(*Server).ServeOne:66\n\tstorj.io/drpc/drpcserver.(*Server).Serve.func2:114\n\tstorj.io/drpc/drpcctx.(*Tracker).track:35”}

Alexey · April 18, 2024, 5:12am

Your RAID is too slow to respond on writability checks.
Please try to increase a writable check timeout on 30s

If the RAID could not pass a writable check even after 1m30s timeout, you may increase more and restart. Repeat until it wouldn’t crash with the writability check error.
If you would be forced to increase it more than 5m0s, your RAID definitely has issues and need investigation. To continue to increase a writability check timeout you will also need to increase the writability check interval (because it’s 5 minutes by default), i.e.

# how frequently to verify writability of storage directory (default 5m0s)
storage2.monitor.verify-dir-writable-interval: 5m30s

# how long to wait for a storage directory writability verification to complete (default 1m0s)
storage2.monitor.verify-dir-writable-timeout: 5m30s

sharkya · April 18, 2024, 2:16pm

I updated the writable-interval in the config. The writable-timeout didn’t exist in the config so I added it and restarted the node.

The RAID seems to perform perfectly fine when the Storj node isn’t running. While the node was offline, I ran a test several times and performance seems normal:

# hdparm -t /dev/md127
/dev/md127:
 Timing buffered disk reads: 790 MB in  3.00 seconds = 263.29 MB/sec

Is there another test you recommend to rule out the hardware as the issue?

Thanks

Alexey · April 21, 2024, 11:12am

Unfortunately it will not emulate the storagenode’s load, as you may confirm. You need to run a node to see what the real (not synthetic) response time of your RAID is.
I would assume that the storagenode software is a best tool to measure the real behavior of your RAID… but it’s an offtopic.

I would recommend to monitor your node’s logs after this change.

sharkya · April 22, 2024, 4:54pm

So far the hardware seems fine. I also noticed that when there is slowness ‘storagenode’ is consuming 99% of my 6-core CPU.

Is there anything else I should try? I vaguely remember seeing someone mentioning some storj node DB can be corrupted and deleting it and restarting the node will cause it to be rebuilt. Is that something I should try?

Edit: I also tried disabling AppArmor as mentioned in another thread

Knowledge · April 23, 2024, 1:27am

Is your node directly writing to the local drive or are you using any kind of network file system?

What kind of hard drives are they? CMR or SMR?

sharkya · April 23, 2024, 1:37am

The node is directly writing to 6x 6TB CMR SATA HDD configured into a 18TB RAID10 via mdadm.

sharkya · April 26, 2024, 1:15am

Apologies, I found a couple of the HDD in the array are SMR. Before I replace the drives, is it possible this is only a problem when the drives are almost full? This node has been running for over 2 years with these drives before this issue. If so, is it possible to reduce the node’s storage by simply reducing the docker run variable -e STORAGE="18TB"?

arrogantrabbit · April 26, 2024, 2:10am

When SMR drives get close to full the performance plummets, just like with SSDs.

Yes, but it will take some time. You will stop receiving ingress and occasional deletions will drive size down.

Toyoo · April 26, 2024, 9:41pm

This will help with node performance only if the drive supports the TRIM/DISCARD protocol and your operating system is configured to issue TRIM/DISCARD requests. Not all SMR HDDs support it, and not all operating systems are aware they need to do so.

If the drive does not support this protocol, it will likely stay slow even after the file system frees up disk space—because there’s no way to inform the drive the space is actually unused now.

sharkya · May 8, 2024, 4:54am

After replacing the SMR drive and reducing my config to 13TB (of 18TB), I am still having the same issue. This time I’m getting a couple of different errors:

ERROR	lazyfilewalker.used-space-filewalker	failed to start subprocess

ERROR	failure during run	{"Process": "storagenode", "error": "database is locked"}
Error: database is locked
INFO exited: storagenode (exit status 1; not expected)
INFO spawned: 'storagenode' with pid 525

Should I try disabling the lazy walker?

Should I try recreating the dbs or move the dbs to my SSD? (How to move DB’s to SSD on Docker)

Alexey · May 11, 2024, 3:42am

you need to move databases to SSD, if the disk cannot keep up or add more RAM, if possible.

sharkya · May 12, 2024, 1:31am

Moving the databases to the SSD solved this problem.