My node is offline and I need some urgent help to get it working again.
Last Thursday my node began trashing. I have seen this behaviour before when there is significant free space and the system tries to service multiple simultaneous upload requests at the same time (sometimes more than 23), only being able to finish very few with success, finishing most of the uploads with “upload canceled (race lost or node shutdown)”. This is a very inefficient mode of operation as very few uploads are actually comitted and disk IO is pretty high. FYI, some bad looking statistics below (the numbers are number of log lines):
372 INFO piecestore upload canceled
47356 INFO piecestore upload canceled (race lost or node shutdown)
66487 INFO piecestore upload started
17526 INFO piecestore uploaded
I am fully aware that this is likely due to the BTRFS file system I am using. Hopefuly, some day I will find a great deal for an EXOS 20TB drive, format it with EXT4 and dedicate it to my node. Until then, I am stuck with BTRFS.
Question 1: Can I set some parameter to limit the number of uploads at any time?
The problem really started when for some reason the node is resarted and the insane file walker kicks in. The file walker will cause a high read IO which, when it coincides with the afore mentioned trashing mode, causes mayor stress on the system. I have the feeling that write opertions are queued in memory due to the disk serving read IO most of the time, causing the memory to fill up until it starts to swap. At this point my NAS (Synology DS720+) becomes unresponsive and I can not even log in with SSH. After some time, something panicks and the container is killed and restarted, triggering a another cycle of the same.
I let this go on for some time until the node finally did not come back online :-(. The last words were:
2023-10-22T06:33:54Z INFO piecestore download started {"process": "storagenode", "Piece ID": "OB2V4XME7HYBZ7LFLDBUXRPBCQY6GPSFC5TNFQAZ45XLWUMIU2DA", "Satellite ID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S", "Action": "GET", "Offset": 0, "Size": 6912, "Remote Address": "172.17.0.1:55164"}
2023-10-22T06:33:58Z ERROR piecestore download failed {"process": "storagenode", "Piece ID": "P4TMAT7ZDKMDIJCZAN3FYDLZHWVOYWU2UJVOK4XXUHYHZVCBCARA", "Satellite ID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S", "Action": "GET", "Remote Address": "172.17.0.1:55144", "error": "order created too long ago: OrderCreation 2023-10-22 05:00:12.052186436 +0000 UTC < SystemClock 2023-10-22 06:33:51.18744009 +0000 UTC m=+12311.012677805", "errorVerbose": "order created too long ago: OrderCreation 2023-10-22 05:00:12.052186436 +0000 UTC < SystemClock 2023-10-22 06:33:51.18744009 +0000 UTC m=+12311.012677805\n\tstorj.io/common/rpc/rpcstatus.Errorf:89\n\tstorj.io/storj/storagenode/piecestore.(*Endpoint).verifyOrderLimit:45\n\tstorj.io/storj/storagenode/piecestore.(*Endpoint).Download:616\n\tstorj.io/common/pb.DRPCPiecestoreDescription.Method.func2:251\n\tstorj.io/drpc/drpcmux.(*Mux).HandleRPC:33\n\tstorj.io/common/rpc/rpctracing.(*Handler).HandleRPC:61\n\tstorj.io/common/experiment.(*Handler).HandleRPC:42\n\tstorj.io/drpc/drpcserver.(*Server).handleRPC:124\n\tstorj.io/drpc/drpcserver.(*Server).ServeOne:66\n\tstorj.io/drpc/drpcserver.(*Server).Serve.func2:114\n\tstorj.io/drpc/drpcctx.(*Tracker).track:35"}
2023-10-22T06:33:56Z ERROR contact:service ping satellite failed {"process": "storagenode", "Satellite ID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S", "attempts": 2, "error": "ping satellite: failed to ping storage node, your node indicated error code: 4, rpc: tcp connector failed: rpc: context deadline exceeded", "errorVerbose": "ping satellite: failed to ping storage node, your node indicated error code: 4, rpc: tcp connector failed: rpc: context deadline exceeded\n\tstorj.io/storj/storagenode/contact.(*Service).pingSatelliteOnce:209\n\tstorj.io/storj/storagenode/contact.(*Service).pingSatellite:157\n\tstorj.io/storj/storagenode/contact.(*Chore).updateCycles.func1:87\n\tstorj.io/common/sync2.(*Cycle).Run:160\n\tstorj.io/common/sync2.(*Cycle).Start.func1:77\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:75"}
2023-10-22T06:34:45Z ERROR piecestore download failed {"process": "storagenode", "Piece ID": "OB2V4XME7HYBZ7LFLDBUXRPBCQY6GPSFC5TNFQAZ45XLWUMIU2DA", "Satellite ID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S", "Action": "GET", "Remote Address": "172.17.0.1:55164", "error": "order created too long ago: OrderCreation 2023-10-22 05:00:12.978152262 +0000 UTC < SystemClock 2023-10-22 06:34:21.775732359 +0000 UTC m=+12341.600970074", "errorVerbose": "order created too long ago: OrderCreation 2023-10-22 05:00:12.978152262 +0000 UTC < SystemClock 2023-10-22 06:34:21.775732359 +0000 UTC m=+12341.600970074\n\tstorj.io/common/rpc/rpcstatus.Errorf:89\n\tstorj.io/storj/storagenode/piecestore.(*Endpoint).verifyOrderLimit:45\n\tstorj.io/storj/storagenode/piecestore.(*Endpoint).Download:616\n\tstorj.io/common/pb.DRPCPiecestoreDescription.Method.func2:251\n\tstorj.io/drpc/drpcmux.(*Mux).HandleRPC:33\n\tstorj.io/common/rpc/rpctracing.(*Handler).HandleRPC:61\n\tstorj.io/common/experiment.(*Handler).HandleRPC:42\n\tstorj.io/drpc/drpcserver.(*Server).handleRPC:124\n\tstorj.io/drpc/drpcserver.(*Server).ServeOne:66\n\tstorj.io/drpc/drpcserver.(*Server).Serve.func2:114\n\tstorj.io/drpc/drpcctx.(*Tracker).track:35"}
2023-10-22T06:35:47Z ERROR piecestore upload failed {"process": "storagenode", "Piece ID": "WBHA4C4OO24DIRVJHDZ2X3RQOK3MP2NBJI47NYGXDLAAJHAUAGCQ", "Satellite ID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S", "Action": "PUT", "error": "order: grace period passed for order limit", "errorVerbose": "order: grace period passed for order limit\n\tstorj.io/storj/storagenode/orders.(*FileStore).BeginEnqueue:86\n\tstorj.io/storj/storagenode/piecestore.(*Endpoint).beginSaveOrder:872\n\tstorj.io/storj/storagenode/piecestore.(*Endpoint).Upload:407\n\tstorj.io/common/pb.DRPCPiecestoreDescription.Method.func1:243\n\tstorj.io/drpc/drpcmux.(*Mux).HandleRPC:33\n\tstorj.io/common/rpc/rpctracing.(*Handler).HandleRPC:61\n\tstorj.io/common/experiment.(*Handler).HandleRPC:42\n\tstorj.io/drpc/drpcserver.(*Server).handleRPC:124\n\tstorj.io/drpc/drpcserver.(*Server).ServeOne:66\n\tstorj.io/drpc/drpcserver.(*Server).Serve.func2:114\n\tstorj.io/drpc/drpcctx.(*Tracker).track:35", "Size": 0, "Remote Address": "172.17.0.1:57802"}
2023-10-22T07:33:38Z INFO Configuration loaded {"process": "storagenode", "Location": "/app/config/config.yaml"}
2023-10-22T07:33:38Z INFO Anonymized tracing enabled {"process": "storagenode"}
2023-10-22T07:33:38Z INFO Operator email {"process": "storagenode", "Address": "marc.vanwijk@gmail.com"}
2023-10-22T07:33:38Z INFO Operator wallet {"process": "storagenode", "Address": "0x82ECd6c0EF426Bb65094D4A1b65B2Be275E2F3a7"}
2023-10-22T07:33:40Z INFO Got a signal from the OS: "terminated" {"process": "storagenode"}
2023-10-22T07:33:40Z INFO server kernel support for tcp fast open unknown {"process": "storagenode"}
2023-10-22T07:33:40Z INFO Telemetry enabled {"process": "storagenode", "instance ID": "1v1o2XhkVMq7tzMCLoRVKBHVVW8LxniHfP26Jz6mgy39QjdS41"}
2023-10-22T07:33:40Z INFO Event collection enabled {"process": "storagenode", "instance ID": "1v1o2XhkVMq7tzMCLoRVKBHVVW8LxniHfP26Jz6mgy39QjdS41"}
2023-10-22T07:33:41Z FATAL Unrecoverable error {"process": "storagenode", "error": "Error creating tables for master database on storagenode: migrate: database: info opening file \"config/storage/info.db\" failed: context canceled\n\tstorj.io/storj/storagenode/storagenodedb.(*DB).openDatabase:347\n\tstorj.io/storj/storagenode/storagenodedb.(*DB).Migration.func1:657\n\tstorj.io/storj/private/migrate.(*Migration).Run:164\n\tstorj.io/storj/storagenode/storagenodedb.(*DB).MigrateToLatest:370\n\tmain.cmdRun:95\n\tmain.newRunCmd.func1:32\n\tstorj.io/private/process.cleanup.func1.4:402\n\tstorj.io/private/process.cleanup.func1:420\n\tgithub.com/spf13/cobra.(*Command).execute:852\n\tgithub.com/spf13/cobra.(*Command).ExecuteC:960\n\tgithub.com/spf13/cobra.(*Command).Execute:897\n\tstorj.io/private/process.ExecWithCustomOptions:113\n\tmain.main:30\n\truntime.main:250", "errorVerbose": "Error creating tables for master database on storagenode: migrate: database: info opening file \"config/storage/info.db\" failed: context canceled\n\tstorj.io/storj/storagenode/storagenodedb.(*DB).openDatabase:347\n\tstorj.io/storj/storagenode/storagenodedb.(*DB).Migration.func1:657\n\tstorj.io/storj/private/migrate.(*Migration).Run:164\n\tstorj.io/storj/storagenode/storagenodedb.(*DB).MigrateToLatest:370\n\tmain.cmdRun:95\n\tmain.newRunCmd.func1:32\n\tstorj.io/private/process.cleanup.func1.4:402\n\tstorj.io/private/process.cleanup.func1:420\n\tgithub.com/spf13/cobra.(*Command).execute:852\n\tgithub.com/spf13/cobra.(*Command).ExecuteC:960\n\tgithub.com/spf13/cobra.(*Command).Execute:897\n\tstorj.io/private/process.ExecWithCustomOptions:113\n\tmain.main:30\n\truntime.main:250\n\tmain.cmdRun:97\n\tmain.newRunCmd.func1:32\n\tstorj.io/private/process.cleanup.func1.4:402\n\tstorj.io/private/process.cleanup.func1:420\n\tgithub.com/spf13/cobra.(*Command).execute:852\n\tgithub.com/spf13/cobra.(*Command).ExecuteC:960\n\tgithub.com/spf13/cobra.(*Command).Execute:897\n\tstorj.io/private/process.ExecWithCustomOptions:113\n\tmain.main:30\n\truntime.main:250"}
Question 2: From another post I understood I could start the node without the databases, let the nde create them, stop the node and copy piece_spaced_used.db back, then restart the node again. Is this accurate? If not, what would be the way to repair this?
This will not be sufficient though as it will land me back in the trashing file walker scenario. So I was thinking to turn off the file walker, let the node fill up the free space, stop the node, start the node again with file walker, let the file walker complete (more than a full day), stop the node and restart without file walker. I would then reapeat this every month or so. This brings me to:
Question 3: How can I completely disable the file walker?
Your help is much appreciated.
Marc