Node Suspended after iSCSI expansion

7tigers · May 9, 2020, 11:34am

Storj has always gracefully handled iSCSI slice expansions, but this time around after I raised it to 3TB to increase earnings…

Instead, my node gets suspended after the Storj process seems to have crashed earlier this morning. (5-9)

Restarting Storj and it now recognizes the additional TB of space, as it did not at time of expansion after I went to bed.

How long does it take to get out of suspend?
All audit checks and uptimes show 100% on all of the satellites.

Interesting log bits:

|2020-05-09T01:06:12.971-0400|ERROR|telemetry|Failed sending report|
{“error”: “lookup collectora.storj.io: no such host”}

|2020-05-09T01:06:51.152-0400|ERROR|telemetry|Failed sending report|{“error”: “lookup collectora.storj.io: no such host”}|

|2020-05-09T01:07:22.969-0400|INFO|piecestore|download started|{“Piece ID”: “A66DBL3IDY4NVFFOQ63REO6NVF2RV4SMITJ4JWIDGHUWBVVSDFFA”, “Satellite ID”: “12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs”, “Action”: “GET_REPAIR”}|
|2020-05-09T01:07:22.970-0400|ERROR|piecestore|download failed|{“Piece ID”: “A66DBL3IDY4NVFFOQ63REO6NVF2RV4SMITJ4JWIDGHUWBVVSDFFA”, “Satellite ID”: “12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs”, “Action”: “GET_REPAIR”, “error”: “usedserialsdb error: disk I/O error: The device is not ready.”, “errorVerbose”: “usedserialsdb error: disk I/O error: The device is not ready.\n\tstorj.io/storj/storagenode/storagenodedb.(*usedSerialsDB).Add:35\n\tstorj.io/storj/storagenode/piecestore.(*Endpoint).verifyOrderLimit:76\n\tstorj.io/storj/storagenode/piecestore.(*Endpoint).doDownload:523\n\tstorj.io/storj/storagenode/piecestore.(*drpcEndpoint).Download:471\n\tstorj.io/common/pb.DRPCPiecestoreDescription.Method.func2:995\n\tstorj.io/drpc/drpcmux.(*Mux).HandleRPC:107\n\tstorj.io/common/rpc/rpctracing.(*Handler).HandleRPC:66\n\tstorj.io/drpc/drpcserver.(*Server).handleRPC:111\n\tstorj.io/drpc/drpcserver.(*Server).ServeOne:62\n\tstorj.io/drpc/drpcserver.(*Server).Serve.func2:99\n\tstorj.io/drpc/drpcctx.(*Tracker).track:51”}|

2020-05-09T06:47:52.873-0400	ERROR	piecestore:cache	error persisting cache totals to the database:	{“error”: “piece space used error: disk I/O error: The device is not ready.”, “errorVerbose”: “piece space used error: disk I/O error: The device is not ready.\n\tstorj.io/storj/storagenode/storagenodedb.(pieceSpaceUsedDB).UpdatePieceTotals:174\n\tstorj.io/storj/storagenode/pieces.(CacheService).PersistCacheTotals:100\n\tstorj.io/storj/storagenode/pieces.(CacheService).Run.func1:85\n\tstorj.io/common/sync2.(Cycle).Run:152\n\tstorj.io/storj/storagenode/pieces.(CacheService).Run:80\n\tstorj.io/storj/private/lifecycle.(Group).Run.func1:56\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:57”}
2020-05-09T06:47:56.135-0400	INFO	piecestore	download started	{“Piece ID”: “GNTPN7XZJE6KCQCOOJI2KPGETYXLINCH7I3YYQ5NUSC7CIRT6D4Q”, “Satellite ID”: “118UWpMCHzs6CvSgWd9BfFVjw5K9pZbJjkfZJexMtSkmKxvvAW”, “Action”: “GET”}
2020-05-09T06:47:56.136-0400	ERROR	piecestore	download failed	{“Piece ID”: “GNTPN7XZJE6KCQCOOJI2KPGETYXLINCH7I3YYQ5NUSC7CIRT6D4Q”, “Satellite ID”: “118UWpMCHzs6CvSgWd9BfFVjw5K9pZbJjkfZJexMtSkmKxvvAW”, “Action”: “GET”, “error”: “usedserialsdb error: disk I/O error: The device is not ready.”, “errorVerbose”: “usedserialsdb error: disk I/O error: The device is not ready.\n\tstorj.io/storj/storagenode/storagenodedb.(usedSerialsDB).Add:35\n\tstorj.io/storj/storagenode/piecestore.(Endpoint).verifyOrderLimit:76\n\tstorj.io/storj/storagenode/piecestore.(Endpoint).doDownload:523\n\tstorj.io/storj/storagenode/piecestore.(drpcEndpoint).Download:471\n\tstorj.io/common/pb.DRPCPiecestoreDescription.Method.func2:995\n\tstorj.io/drpc/drpcmux.(Mux).HandleRPC:107\n\tstorj.io/common/rpc/rpctracing.(Handler).HandleRPC:66\n\tstorj.io/drpc/drpcserver.(Server).handleRPC:111\n\tstorj.io/drpc/drpcserver.(Server).ServeOne:62\n\tstorj.io/drpc/drpcserver.(Server).Serve.func2:99\n\tstorj.io/drpc/drpcctx.(Tracker).track:51”}

Alexey · May 9, 2020, 11:38am

I would like to suggest you to check your drive with system tools. It doesn’t look good.

The local network issue. More like with a DNS client

7tigers · May 9, 2020, 12:33pm

Storj crashed hours after what has been a normally graceful iSCSI expansion… it didn’t start having issues until after 6am Eastern time.

DNS is working fine, I have done multiple NSlookups on the collectora and it resolves as a cname to collectorb. As long as Cloudflare can find your records, the collector will be good.

Robustness:
Multiple Retries for lookups to collector names and to the backup Storj collector servers, a single bad nslookup should not be able to fail out one’s node.

Checking the log now, it says upload/download cancelled over and over
It is getting Tardigrade files, but the Satellites will not unlock the suspension even after everything is back online…

How long are the re-audits going to take?

How long does it take to get out of suspension on events when the cause of failure to pass audit is not self-inflicted?

Can you force the satellites to unlock my node?

Alexey · May 9, 2020, 1:57pm

It should pass several audits before went out of suspension.
I would like to suggest you to check your logs for failed GET_AUDIT requests.

7tigers · May 9, 2020, 2:20pm

After a reboot, it shows the node downloading and uploading Tardigrade files, hopefully suspension will pass soon.

Alexey · May 9, 2020, 2:22pm

If you see downloads and uploads it is more like your node went out of suspension already, otherwise you will not see a traffic from the customers of the suspending satellite.

KernelPanick · May 9, 2020, 6:09pm

Not sure the process you used for expanding. But, to be extra cautious you should make sure that the node is shut down before making any changes to the LUN. Services might expect to be restarted when changes are made, and the storagenode won’t like that.

hoarder · May 9, 2020, 6:35pm

I had iscsi target crash under heavy io load generated by storj. From what I could see in logs I got suspended really fast, node was sitting with drive disconnected for about 6 hours before I woke up. Getting out of suspension took less than a day, node is happy now.

Why do live resize? A bit of downtime to resize the target and check that everything works well is acceptable.