Fatal Error on my Node

Hogan1337 · March 29, 2023, 7:51am

If it helps I can provide a teamviewer session on my node so the developer can try to find the problem.

Alexey · March 29, 2023, 8:18am

Then it is not a cable problem likely, but a problem with VM software.
Or the disk is too slow to respond.

The only feature what was added - is a timeout when the node cannot acces the disk (it was absent before and you may fail audits because of audits timout - 5 minutes for each of 3 tries).

daki82 · March 29, 2023, 8:32am

its getting worse…
…since i run no VM( its direkt on windows)
have pretty egress an ingress all day, but downtimes. STILL NO AUDIT ERRORS.

Alexey · March 29, 2023, 8:35am

But still your disk is not responsive. Is it a SMR?

daki82 · March 29, 2023, 8:40am

its an WD elements 12TB CMR and works fine, also with storj running, drive is usable.
nice 2digit gigabyte ingress/egress all day despite downtime because the node service stops.
tested yesterday and today.

Alexey · March 29, 2023, 8:44am

This is weird, my Windows node doesn’t have this issue.
Is this drive used by something else? Or do you may be use a smb to connect this drive?

pcresumen · March 29, 2023, 8:52am

After the service stopped last time, I have changed the maximum capacity of the node and I have indicated precisely the capacity that was full, in a few words if the maximum was 5TB and I currently have 3.5TB full, I have indicated that your maximum capacity is 3.5TB in the config.yaml file. At the moment 25+ hours without stopping the service, but obviously that is not a solution since the hard drive is not full…

daki82 · March 29, 2023, 8:59am

none of that. its excluded from av and firewall, containingdata indexing disabled. no other use, no network drive.

Alexey · March 29, 2023, 9:00am

Then I have no idea, why your drive need so much time 1m to return a required file or be able to write something.

daki82 · March 29, 2023, 9:12am

can too manny canceled downloads cause the nodesoftware to get stuck at assinging bandwith or something? is the paralel download number somehow limmited and canceling incorrectly occupies some ressources who run out??? and therefore timing out ? it obviously takes just one minute, or 3 after that it runs again fine ? is your node full already? mine is not.

Hogan1337 · March 29, 2023, 9:17am

I got 13 Nodes as VM. All are working. The Disk is 5 month old and I got 3 nodes with this Disk.

daki82 · March 29, 2023, 9:41am

this points to an error with incoming datastreams (if it stays on now).

Dave-Baldwin · March 29, 2023, 12:16pm

Fair enough but my audit score was previously 100% straight across the board. Never an issue failing audits.

Asura99 · March 29, 2023, 3:19pm

My Node is now runnign 9,5h without restarting the service .

just changed the
in the config , log.**** from info to error … so that the logdata is not taking so much space ~ 50mb per day … now around 11mb (archived yesterday) .

my audit goes back up slowly

online status goes also a bit up from 5/6 satellites , 1 had dropped 0.05%

The Dashboard still doesn´t fit together … (screenshots)
28.03

29.03

daki82 · March 29, 2023, 3:58pm

difference to the other errors is the audit score,its not 100% as the others are audit 100%.

also consider its used+trash=used in windows

maybe router resynced? maybe drive short power cut or errors, do chkdsk like mentioned above.
or faulty cable or somthing. check windows errorlog also. etc

wanted to add that most errors in my log ~95%are related to sat-id 12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs
looks like it spams my node with full power.

Marcus · March 29, 2023, 8:10pm

Hello,
i have exactly the same issues described here. It is a windows 11 Node with a dedicated HDD for StorJ. Score and Uptime was near 100%.
Chkdsk shows now errors and erverythink seems to be okay. But my Node service crashes after a few hour (between 5 -10 after retsart). It seems to start with the update to Version v1.75.2. My stats are getting worse because of the restarts and downtimes in the last 3 days.

Is it possible to downgrade to check if it is the new version?

2023-03-29T06:28:52.514+0200 ERROR services unexpected shutdown of a runner {“name”: “piecestore:monitor”, “error”: “piecestore monitor: timed out after 1m0s while verifying writability of storage directory”, “errorVerbose”: “piecestore monitor: timed out after 1m0s while verifying writability of storage directory\n\tstorj.io/storj/storagenode/monitor.(*Service).Run.func2.1:150\n\tstorj.io/common/sync2.(*Cycle).Run:160\n\tstorj.io/storj/storagenode/monitor.(*Service).Run.func2:146\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:75”}
2023-03-29T06:29:07.514+0200 WARN services service takes long to shutdown {“name”: “piecestore:cache”}
2023-03-29T06:29:07.514+0200 WARN services service takes long to shutdown {“name”: “gracefulexit:chore”}

snorkel · March 29, 2023, 8:58pm

All my Synology Docker nodes are at 1.75.2 more than 24h and all are working fine. I don’t get those errors. I didn’t enabled TCP FAST OPEN. I saw 2 online scores droped below 96% (diferent nodes, diferent sats) but they are recovering. It was a temporary hickup, maybe caused by the update.
I think it is a problem related to Windows nodes or TCP FAST OPEN. I only get these 3 types of errors in my logs:

2023-03-29T20:28:39.421873561Z	stdout	2023-03-29T20:28:39.421Z ERROR piecestore upload failed {"Process": "storagenode", "Piece ID": "G5L33ZSOFRYSIBOTEJQC5RFNUTNHMAT4PCLW3YRY5V5K2UPBF76A", "Satellite ID": "12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs", "Action": "PUT", "error": "manager closed: unexpected EOF", "errorVerbose": "manager closed: unexpected EOF\n\tgithub.com/jtolio/noiseconn.(*Conn).readMsg:197\n\tgithub.com/jtolio/noiseconn.(*Conn).Read:143\n\tstorj.io/drpc/drpcwire.(*Reader).ReadPacketUsing:96\n\tstorj.io/drpc/drpcmanager.(*Manager).manageReader:223", "Size": 524288, "Remote Address": "172.17.0.1:33498"}
2023-03-29T20:26:39.618812359Z	stdout	2023-03-29T20:26:39.618Z ERROR piecestore download failed {"Process": "storagenode", "Piece ID": "5H3C3BTMJUJT3HABYZWFV2G2TFFTRHK6KIZJLNNEAGCRODIQAIVA", "Satellite ID": "12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs", "Action": "GET", "Offset": 1569792, "Size": 311296, "Remote Address": "172.17.0.1:33898", "error": "write tcp 172.17.0.3:28967->172.17.0.1:33898: write: broken pipe", "errorVerbose": "write tcp 172.17.0.3:28967->172.17.0.1:33898: write: broken pipe\n\tstorj.io/drpc/drpcstream.(*Stream).rawFlushLocked:401\n\tstorj.io/drpc/drpcstream.(*Stream).MsgSend:462\n\tstorj.io/common/pb.(*drpcPiecestore_DownloadStream).Send:349\n\tstorj.io/storj/storagenode/piecestore.(*Endpoint).Download.func6.2:729\n\tstorj.io/common/rpc/rpctimeout.Run.func1:22"}
2023-03-29T20:26:17.632935680Z	stdout	2023-03-29T20:26:17.630Z ERROR piecestore download failed {"Process": "storagenode", "Piece ID": "FTT53ZRFGGIU4ARBB7L37JYGNOCREFXB36CSV5P6GWW2B6BWL5DQ", "Satellite ID": "12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs", "Action": "GET", "Offset": 1239296, "Size": 311296, "Remote Address": "172.17.0.1:33788", "error": "write tcp 172.17.0.3:28967->172.17.0.1:33788: write: broken pipe", "errorVerbose": "write tcp 172.17.0.3:28967->172.17.0.1:33788: write: broken pipe\n\tstorj.io/drpc/drpcstream.(*Stream).rawWriteLocked:367\n\tstorj.io/drpc/drpcstream.(*Stream).MsgSend:458\n\tstorj.io/common/pb.(*drpcPiecestore_DownloadStream).Send:349\n\tstorj.io/storj/storagenode/piecestore.(*Endpoint).Download.func6.2:729\n\tstorj.io/common/rpc/rpctimeout.Run.func1:22"}

snorkel · March 29, 2023, 10:03pm

I also have log level set to error. Maybe the info level with the new noise is causing high i/o on your drives. Put your log lever to error and wait.

Alexey · March 30, 2023, 3:06am

then reduce usage of this disk to one node per disk, not three.

Unfortunately this issue may happen in a random time, for example your disk developed a bad sector and stuck on it trying to read. Without a timeouts your node will start to fail audits (because it cannot read even a single piece).
So, this timeout in the dir verification is a good thing.

The data is coming directly from/to the customers of that satellite, not from/to the satellite itself.

So no need to revert, but need to fix an underlaying issue - why your disk become so saturated or even disconnected to be unable to write a few bytes to the disk after 1 minute of timeout?

Dave-Baldwin · March 30, 2023, 4:38am

I am downgrading tonight to 1.74.1 on my Windows node.
I expect my node to stop crashing because I don’t think this previous version has this readability/writeability check. However I am testing this to see if my Audit scores suffer. If they don’t suffer, then I will be questioning whether there is really anything wrong with my drive.

Drive is 5 TB in size and mostly full of Storj data, so a chkdsk /B is going to take 60 hours if the ETA is accurate. Given these repeated node FATAL errors my online score is already suffering a bit so I am going to let it recover for a while.

Were the parameters added to config.yaml to allow this timeout value to be changed from 1m0s? I looked but did not find anything with a description closely matching this timeout.