Fatal Error on my Node

If it helps I can provide a teamviewer session on my node so the developer can try to find the problem.

Then it is not a cable problem likely, but a problem with VM software.
Or the disk is too slow to respond.

The only feature what was added - is a timeout when the node cannot acces the disk (it was absent before and you may fail audits because of audits timout - 5 minutes for each of 3 tries).

1 Like


its getting worse…
…since i run no VM( its direkt on windows)
have pretty egress an ingress all day, but downtimes. STILL NO AUDIT ERRORS.

But still your disk is not responsive. Is it a SMR?

its an WD elements 12TB CMR and works fine, also with storj running, drive is usable.
nice 2digit gigabyte ingress/egress all day despite downtime because the node service stops.
tested yesterday and today.

This is weird, my Windows node doesn’t have this issue.
Is this drive used by something else? Or do you may be use a smb to connect this drive?

After the service stopped last time, I have changed the maximum capacity of the node and I have indicated precisely the capacity that was full, in a few words if the maximum was 5TB and I currently have 3.5TB full, I have indicated that your maximum capacity is 3.5TB in the config.yaml file. At the moment 25+ hours without stopping the service, but obviously that is not a solution since the hard drive is not full…

none of that. its excluded from av and firewall, containingdata indexing disabled. no other use, no network drive.

Then I have no idea, why your drive need so much time 1m to return a required file or be able to write something.

can too manny canceled downloads cause the nodesoftware to get stuck at assinging bandwith or something? is the paralel download number somehow limmited and canceling incorrectly occupies some ressources who run out??? and therefore timing out ? it obviously takes just one minute, or 3 after that it runs again fine ? is your node full already? mine is not.

I got 13 Nodes as VM. All are working. The Disk is 5 month old and I got 3 nodes with this Disk.

this points to an error with incoming datastreams (if it stays on now).

Fair enough but my audit score was previously 100% straight across the board. Never an issue failing audits.

1 Like

My Node is now runnign 9,5h without restarting the service .

just changed the
in the config , log.**** from info to error … so that the logdata is not taking so much space ~ 50mb per day … now around 11mb (archived yesterday) .

my audit goes back up slowly

online status goes also a bit up from 5/6 satellites , 1 had dropped 0.05%

The Dashboard still doesn´t fit together … (screenshots)
28.03


29.03

difference to the other errors is the audit score,its not 100% as the others are audit 100%.

also consider its used+trash=used in windows

maybe router resynced? maybe drive short power cut or errors, do chkdsk like mentioned above.
or faulty cable or somthing. check windows errorlog also. etc

wanted to add that most errors in my log ~95%are related to sat-id 12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs
looks like it spams my node with full power.

Hello,
i have exactly the same issues described here. It is a windows 11 Node with a dedicated HDD for StorJ. Score and Uptime was near 100%.
Chkdsk shows now errors and erverythink seems to be okay. But my Node service crashes after a few hour (between 5 -10 after retsart). It seems to start with the update to Version v1.75.2. My stats are getting worse because of the restarts and downtimes in the last 3 days.

Is it possible to downgrade to check if it is the new version?

2023-03-29T06:28:52.514+0200 ERROR services unexpected shutdown of a runner {“name”: “piecestore:monitor”, “error”: “piecestore monitor: timed out after 1m0s while verifying writability of storage directory”, “errorVerbose”: “piecestore monitor: timed out after 1m0s while verifying writability of storage directory\n\tstorj.io/storj/storagenode/monitor.(*Service).Run.func2.1:150\n\tstorj.io/common/sync2.(*Cycle).Run:160\n\tstorj.io/storj/storagenode/monitor.(*Service).Run.func2:146\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:75”}
2023-03-29T06:29:07.514+0200 WARN services service takes long to shutdown {“name”: “piecestore:cache”}
2023-03-29T06:29:07.514+0200 WARN services service takes long to shutdown {“name”: “gracefulexit:chore”}

All my Synology Docker nodes are at 1.75.2 more than 24h and all are working fine. I don’t get those errors. I didn’t enabled TCP FAST OPEN. I saw 2 online scores droped below 96% (diferent nodes, diferent sats) but they are recovering. It was a temporary hickup, maybe caused by the update.
I think it is a problem related to Windows nodes or TCP FAST OPEN. I only get these 3 types of errors in my logs:

2023-03-29T20:28:39.421873561Z	stdout	2023-03-29T20:28:39.421Z ERROR piecestore upload failed {"Process": "storagenode", "Piece ID": "G5L33ZSOFRYSIBOTEJQC5RFNUTNHMAT4PCLW3YRY5V5K2UPBF76A", "Satellite ID": "12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs", "Action": "PUT", "error": "manager closed: unexpected EOF", "errorVerbose": "manager closed: unexpected EOF\n\tgithub.com/jtolio/noiseconn.(*Conn).readMsg:197\n\tgithub.com/jtolio/noiseconn.(*Conn).Read:143\n\tstorj.io/drpc/drpcwire.(*Reader).ReadPacketUsing:96\n\tstorj.io/drpc/drpcmanager.(*Manager).manageReader:223", "Size": 524288, "Remote Address": "172.17.0.1:33498"}
2023-03-29T20:26:39.618812359Z	stdout	2023-03-29T20:26:39.618Z ERROR piecestore download failed {"Process": "storagenode", "Piece ID": "5H3C3BTMJUJT3HABYZWFV2G2TFFTRHK6KIZJLNNEAGCRODIQAIVA", "Satellite ID": "12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs", "Action": "GET", "Offset": 1569792, "Size": 311296, "Remote Address": "172.17.0.1:33898", "error": "write tcp 172.17.0.3:28967->172.17.0.1:33898: write: broken pipe", "errorVerbose": "write tcp 172.17.0.3:28967->172.17.0.1:33898: write: broken pipe\n\tstorj.io/drpc/drpcstream.(*Stream).rawFlushLocked:401\n\tstorj.io/drpc/drpcstream.(*Stream).MsgSend:462\n\tstorj.io/common/pb.(*drpcPiecestore_DownloadStream).Send:349\n\tstorj.io/storj/storagenode/piecestore.(*Endpoint).Download.func6.2:729\n\tstorj.io/common/rpc/rpctimeout.Run.func1:22"}
2023-03-29T20:26:17.632935680Z	stdout	2023-03-29T20:26:17.630Z ERROR piecestore download failed {"Process": "storagenode", "Piece ID": "FTT53ZRFGGIU4ARBB7L37JYGNOCREFXB36CSV5P6GWW2B6BWL5DQ", "Satellite ID": "12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs", "Action": "GET", "Offset": 1239296, "Size": 311296, "Remote Address": "172.17.0.1:33788", "error": "write tcp 172.17.0.3:28967->172.17.0.1:33788: write: broken pipe", "errorVerbose": "write tcp 172.17.0.3:28967->172.17.0.1:33788: write: broken pipe\n\tstorj.io/drpc/drpcstream.(*Stream).rawWriteLocked:367\n\tstorj.io/drpc/drpcstream.(*Stream).MsgSend:458\n\tstorj.io/common/pb.(*drpcPiecestore_DownloadStream).Send:349\n\tstorj.io/storj/storagenode/piecestore.(*Endpoint).Download.func6.2:729\n\tstorj.io/common/rpc/rpctimeout.Run.func1:22"}

I also have log level set to error. Maybe the info level with the new noise is causing high i/o on your drives. Put your log lever to error and wait.

1 Like

then reduce usage of this disk to one node per disk, not three.

Unfortunately this issue may happen in a random time, for example your disk developed a bad sector and stuck on it trying to read. Without a timeouts your node will start to fail audits (because it cannot read even a single piece).
So, this timeout in the dir verification is a good thing.

The data is coming directly from/to the customers of that satellite, not from/to the satellite itself.

So no need to revert, but need to fix an underlaying issue - why your disk become so saturated or even disconnected to be unable to write a few bytes to the disk after 1 minute of timeout?

I am downgrading tonight to 1.74.1 on my Windows node.
I expect my node to stop crashing because I don’t think this previous version has this readability/writeability check. However I am testing this to see if my Audit scores suffer. If they don’t suffer, then I will be questioning whether there is really anything wrong with my drive.

Drive is 5 TB in size and mostly full of Storj data, so a chkdsk /B is going to take 60 hours if the ETA is accurate. Given these repeated node FATAL errors my online score is already suffering a bit so I am going to let it recover for a while.

Were the parameters added to config.yaml to allow this timeout value to be changed from 1m0s? I looked but did not find anything with a description closely matching this timeout.

1 Like