Fatal Error on my Node

Teclox · April 18, 2023, 1:18pm

strange thing is that in the config file the writabily time out is 5m while the error happens just in 1m, how is that possible?

tditte · April 18, 2023, 4:05pm

filestore.write-buffer-size: 128.0 KiB was the error, is set to 4MiB for the node that runs without dropouts. I’ve gone through everything again and that’s the only parameter that was different. only the order of the parameters is partially different. now both nodes are running great with 4MiB.

now the only question is why the setting was different. I myself have not changed or entered any settings until the error occurred. Then why is one node installed with 4MiB and the other with 128KiB?

daki82 · April 18, 2023, 4:52pm

Guess standard is 128k

Since some pieces are 2-3 mb maybe this caused slowlyness.

tditte · April 18, 2023, 5:05pm

But why is one node with 128KiB and the other with 4MiB, although I hadn’t changed anything in either of them. (Both version 1.75.2.0) that would mean something was changed with your update or version, wasn’t it? and a node did not get this change.

daki82 · April 18, 2023, 5:11pm

Was the # still in front or not?

tditte · April 18, 2023, 5:19pm

no was active on both nodes so without #

only the point at which the setting was different, once directly at the beginning, and once in the middle of the config file.

Alexey · April 19, 2023, 7:29am

More like an interval, not timeout. Please check carefully. These timeout parameters have been added only recently, so they likely missed in your config.yaml.
However, if you added parameter for writeable check timeout to 5 minutes, but still receives errors of writeable timeout after 1 minute, this is mean that you either didn’t save the file and/or did not restart the node after the change.

Alexey · April 19, 2023, 7:33am

Perhaps the default value has changed recently.
Added to my summary post, thank you!

Alexey · April 19, 2023, 7:36am

What’s device, OS, filesystem, how is disk connected, is it a SMR?

Teclox · April 19, 2023, 7:39am

Well i am pretty sure i did restart, anyways, rolled back to 1.74.2 no problems anymore, but i wonder for the next version how can i do it. I dont think my node is picking up my config file correctly

Alexey · April 19, 2023, 7:59am

This is equal to set a timeout to generously great value like a month or two, because the problem is not gone, it’s now hided, because checkers did not have this timeout at all and they will just hang forever instead of the crash, when they have a problem with a writeability or readability check.

paarsand · April 19, 2023, 7:26am

My node suddenly goes offline every other day.
The only trace in log is this. Can anyone advice on solution?

2023-04-18T21:36:22.082+0200 FATAL Unrecoverable error {“error”: “piecestore monitor: timed out after 1m0s while verifying writability of storage directory”, “errorVerbose”: “piecestore monitor: timed out after 1m0s while verifying writability of storage directory\n\tstorj.io/storj/storagenode/monitor.(*Service).Run.func2.1:150\n\tstorj.io/common/sync2.(*Cycle).Run:160\n\tstorj.io/storj/storagenode/monitor.(*Service).Run.func2:146\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:75”}

snorkel · April 19, 2023, 10:03pm

See previous replys in this topic. The solution is there. You have to increase the timeout setting in 30s increments untill you get no restarts.

Sharkey · April 19, 2023, 4:29pm

Hoping for help here as our Node “v1.77.0-rc” has started behaving very strangely with it shutting down after a while. However, managed to solve it by enabling service restart on service Storj V3 Node.
Obviously not good as we lose suspension & Audit.
Previously had V.1.76.2 with the same error.
Attaching here are the last lines of the log.
Thanks for any tips/help getting around this.
Could it be an idea to downgrade?

2023-04-19T18:17:41.627+0200 INFO piecedeleter delete piece sent to trash {“Satellite ID”: “12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs”, “Piece ID”: “CZ37MKFF3IN2XONXYUPTXX6OWE75YZB7MTBV7XNI2V25”}6CLQONQ3
2023-04-19T18:17:41.651+0200 INFO piecestore upload started {“Piece ID”: “ETY2GG6XT3TUWOAISXK5HPFQK7TGX5X3IUOGP5DBJNZIO3XTXBAQ”, “Satellite ID”: “12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S”, “Action”: “PUT”, “Available Space”: 7570,757,907,247 Remote Address": “5.161.149.40:53872”}
2023-04-19T18:17:41.749+0200 INFO piecestore uploaded {“Piece ID”: “ETY2GG6XT3TUWOAISXK5HPFQK7TGX5X3IUOGP5DBJNZIO3XTXBAQ”, “Satellite ID”: “12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S”, “Action”: “PUT60”, “Remote Address”: 25 “: “5.161.149.40:53872”}
2023-04-19T18:17:42.844+0200 INFO piecedeleter delete piece sent to trash {“Satellite ID”: “12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S”, “Piece ID”: “5FG3AFF43HUGKCLS4DJD5ACSVJXDUSWLLPUKHMY3HNNTV5FRZELQ”}
2023-04-19T18:17:43.683+0200 INFO piecedeleter delete piece sent to trash {“Satellite ID”: “12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S”, “Piece ID”: “ETY2GG6XT3TUWOAISXK5HPFQK7TGX5X3IUOGP5DBJNZIO3XTXBAQ”}
2023-04-19T18:17:44.050+0200 INFO piecestore upload started {“Piece ID”: “5732ZUWXHQU2UNRFSSPTDJJZIE3FS524U33GUXV2RONC33QGBIBA”, “Satellite ID”: “12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S”, “Action”: “SpacePUT207664098640”, “Available” Remote Address”: “5.161.184.111:37590”}
2023-04-19T18:17:44.078+0200 INFO piecestore download started {“Piece ID”: “52KZP5A2RBS5PLVNT7A5VSAPOKB5L4QHW3V3HIIRA3IJQQ36SGRA”, “Satellite ID”: “12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S”, “ActionSizeOffset”: “GET”, "0 ": 8960, “Remote Address”: “5.161.111.220:61416”}
2023-04-19T18:17:44.203+0200 INFO piecestore uploaded {“Piece ID”: “5732ZUWXHQU2UNRFSSPTDJJZIE3FS524U33GUXV2RONC33QGBIBA”, “Satellite ID”: “12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S”, “Action”: “PUT3”, “Remote Address: 76”, “Size”: ": “5.161.184.111:37590”}
2023-04-19T18:17:44.337+0200 INFO piecestore downloaded {“Piece ID”: “52KZP5A2RBS5PLVNT7A5VSAPOKB5L4QHW3V3HIIRA3IJQQ36SGRA”, “Satellite ID”: “12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S”, “ActionSizeOffset”: “GET”, "0 : 8960, “Remote Address”: “5.161.111.220:61416”}
2023-04-19T18:17:44.403+0200 INFO piecestore download started {“Piece ID”: “4NVKR57GLMNIN5ENWOZRYL6TTKYTFAAWMSTTSJN7ZURFQJRXOY7A”, “Satellite ID”: “121RTSDpyNZVcEU84Ticf2L1ntiuUimbWgfATz21tuvgk3vzoA6”, “Action”: “GET2”, “Offset”: “2”, “Size: 1696”. ": 2560, “Remote Address”: “167.235.66.196:26890”}
2023-04-19T18:17:44.602+0200 INFO piecestore downloaded {“Piece ID”: “4NVKR57GLMNIN5ENWOZRYL6TTKYTFAAWMSTTSJN7ZURFQJRXOY7A”, “Satellite ID”: “121RTSDpyNZVcEU84Ticf2L1ntiuUimbWgfATz21tuvgk3vzoA6”, “Action”: “GET2”, “Offset”: “2”, “Size”: “1696”. : 2560, “Remote Address”: “167.235.66.196:26890”}
2023-04-19T18:17:44.758+0200 INFO piecestore download started {“Piece ID”: “LHXKHEZJVWJAQTZASXYCXOCJXUQFE62AV6OPPN43MFNTE7KZKIQA”, “Satellite ID”: “12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S”, “Action”: “GET_REPAIR”, “Size: 0”, “Offset”: ": 6400, “Remote Address”: “5.161.217.169:33024”}
2023-04-19T18:17:44.904+0200 INFO piecestore downloaded {“Piece ID”: “LHXKHEZJVWJAQTZASXYCXOCJXUQFE62AV6OPPN43MFNTE7KZKIQA”, “Satellite ID”: “12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S”, “Action”: “GET_REPAIR”, “Size: 0”, “Offset”: : 6400, “Remote Address”: “5.161.217.169:33024”}
2023-04-19T18:17:45.085+0200 INFO piecestore upload started {“Piece ID”: “RAULCGNFQJ45S7EF4QLIZ3KIOVD47TWVGMKJMI26FBS35UDU55TQ”, “Satellite ID”: “12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S”, “Action”: "SpacePUT27987257987257988798798787987987987987987987999888885555 of, “Avail of of " Remote Address”: “5.161.184.111:37590”}
2023-04-19T18:17:45.145+0200 INFO piecestore uploaded {“Piece ID”: “RAULCGNFQJ45S7EF4QLIZ3KIOVD47TWVGMKJMI26FBS35UDU55TQ”, “Satellite ID”: “12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S”, “Action”: “2PUT”, “Size: Remote Address: 48” ": “5.161.184.111:37590”}
2023-04-19T18:17:45.309+0200 INFO piecestore uploaded {“Piece ID”: “6Y6VC2ABX7H4AERMR5CQCZLJALWQYSG2VNJF3MLQ2OUI7JCC5YWA”, “Satellite ID”: “12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S”, “Action”: “2PUT”, “Remote Address”, “Size”: ": “38.88.241.42:13676”}
2023-04-19T18:17:45.313+0200 INFO piecestore upload started {“Piece ID”: "NESEHTMN2UP65AONWAS2SKTQNG7CYD24SAFQPEZYLFT4TOOALX

Stob · April 19, 2023, 5:55pm

Hi @Sharkey
This log is not helpful. Please post the entries just before crash. It should have ERROR or FATAL state shown.

At a guess it sounds like your issue may be this - Fatal Error on my Node

Sharkey · April 19, 2023, 6:15pm

Hi @Regular! Super thanks for the quick response.
Here is the log text before it went offline and started up again.
Have also checked Storage space and there is about 9TB available.
Will run a chkdsk on the drive.

2023-04-19T16:10:57.843+0200 ERROR piecestore upload failed {“Piece ID”: “4BKRV4W4R2PW2FAF5JOEUHJUWMWRYXXWN7TKH2FP3MPAATOAOEBA”, “Satellite ID”: “12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S”, “Action”: “context”, “cancelled” error: , “errorVerbose”: “context canceled\n\tstorj.io/common/rpc/rpcstatus.Wrap:75\n\tstorj.io/storj/storagenode/piecestore.(*Endpoint).Upload.func5:498\n\ tstorj.io/storj/storagenode/piecestore.(*Endpoint).Upload:504\n\tstorj.io/common/pb.DRPCPiecestoreDescription.Method.func1:243\n\tstorj.io/drpc/drpcmux.(*Mux ).HandleRPC:33\n\tstorj.io/common/rpc/rpctracing.(*Handler).HandleRPC:61\n\tstorj.io/common/experiment.(*Handler).HandleRPC:42\n\tstorj. io/drpc/drpcserver.(*Server).handleRPC:124\n\tstorj.io/drpc/drpcserver.(*Server).ServeOne:66\n\tstorj.io/drpc/drpcserver.(*Server).Serve .func2:114\n\tstorj.io/drpc/drpcctx.(*Tracker).track:35”, “Size”: 6400, “Remote Address”: “5.161.146.178:17330”}
2023-04-19T16:10:58.094+0200 FATAL Unrecoverable error {“error”: “piecestore monitor: timed out after 1m0s while verifying writability of storage directory”, “errorVerbose”: “piecestore monitor: timed out after 1m0s while verifying writability of storage directory\n\tstorj.io/storj/storagenode/monitor.(*Service).Run.func2.1:163\n\tstorj.io/common/sync2.(*Cycle).Run:160\n\ tstorj.io/storj/storagenode/monitor.(*Service).Run.func2:155\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:75”}

Stob · April 19, 2023, 8:40pm

Yes, this is the same error. Check for options in the thread I linked - Fatal Error on my Node. Your node storage is running slowly and timed out for 1 minute whilst a write check was being performed. The disk should be fine for the data integrity it’s just slow to respond.

paarsand · April 20, 2023, 7:11pm

Well, this wrecked my node. had so many offline periods because of this, and after restart number 10 or so this one comes up

For those not fluent in Norwegian:
Cannot start service Storj V3 on local computer.
Error 1067: Process was unexpectedly closed/exited.
Even though my english is so so, maybe you have advice?

It also happened before i changed the timeout. Though i don’t understand why my node should suddenly start with this with 3tb data onboard and no hickups in a long while

Teclox · April 21, 2023, 12:08pm

How about my config file not being picked up by the node? I set to 5minutes but it kept crashing after 1 minute delay not 5

Alexey · April 21, 2023, 4:18pm

Perhaps you broke config.yaml. What’s last 20 lines in your logs?