Noob Node Runner need some help pls

Yes, but I wanted to know what was the error.

search your log for FATAL errors.
(to reduce log, use logrotate or simply set the loglevel to error in config.yaml)

suspension comes with router reboot or internet reconnects as well.
maybe you have the 1min timeout error, can you tell us about the hardware of the node?

since he calls himself a noob, i doubt hes firm with powershell

To copy and paste command from the linked KB article?

I guess they are able and capable, because they managed to at least generate an identity and sign it, which is CLI-only.

Fair point, lets wait the answer(s). im courious too.

i forgot that entirely, and i started a node 3 weeks ago :sweat_smile:

Hello @Alexey
here is the powershell output as followed instructions from the page that you provided: PS C:\Users\NED> sls “GET_AUDIT|GET_REPAIR” “C:\Program Files\Storj\Storage Node\storagenode.log” | sls failed

C:\Program Files\Storj\Storage Node\storagenode.log:17669917:2023-06-15T17:34:48.515-0400 ERROR piecestore download
failed {“Piece ID”: “6Y5CZIHQ7RTOV2KN42UPTSZ5M6JGET57CY3XBIADHJDTVKPG6KTQ”, “Satellite ID”:
“12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs”, “Action”: “GET_REPAIR”, “Offset”: 0, “Size”: 344064, “Remote
Address”: “128.140.12.124:56382”, “error”: “write tcp 10.0.1.33:28967->128.140.12.124:56382: use of closed network
connection”, “errorVerbose”: “write tcp 10.0.1.33:28967->128.140.12.124:56382: use of closed network connection\n\tstor
j.io/drpc/drpcstream.(*Stream).rawFlushLocked:401\n\tstorj.io/drpc/drpcstream.(*Stream).MsgSend:462\n\tstorj.io/common/
pb.(*drpcPiecestore_DownloadStream).Send:349\n\tstorj.io/storj/storagenode/piecestore.(*Endpoint).sendData.func1:807\n
tstorj.io/common/rpc/rpctimeout.Run.func1:22”}
I hope this makes any sense for you.

Just to make it clear @daki82 I’m noob to Storj node I know my way around computers, as for searching for FATAL error didn’t give me any results when searched LOG file, but I do have lot of upload or download errors don’t know if that matters. I started a node as an experiment with mini PC with Celeron 1017U dual core CPU, 6GB ram, win 10 running on SSD and using network attached iscsi 10TB from synology server as STORJ drive, 2GB LAN and 500mb/s internet connection, (this is an experiment to see if it’s wort for me to run another spare server that is hanging around offline for now with 250TB storage and 2 Xeon CPUs). both mini PC and Synology server are powered 24/7 and figured that I can use extra 10TB for Storj and see what comes out of it. both mini pc and synology have min downtime and internet is solid, they do get some system updates that cause some down time but not for too long.

And for auto load on startup I did solved while ago by following guide to your forum posts.
thank you all for replies and trying to get this figured out.

1 Like

And this is the reason that node did not start after reboot, and why you do have suspension score dropped:

The network attached storage is not so reliable as local.
You need to configure your storagenode service to depend on the network and have a delay to allow the OS to fully propagate the network before the node start:

By the way, if your model of Synology does support docker, I would recommend to run storagenode directly on Synology instead. You need only activate docker and ssh and follow the guide for CLI: CLI Install - Storj Docs
You should skip the section how to install docker, since it performed differently.
You may also migrate the current node: Migrating from Windows GUI installation to Docker CLI - Storj Docs
When you would start an existing node, you must skip the setup step too, because it should be performed only once for the entire node’s life. And you already did it when installed the Windows GUI node.

1 Like

Hello Alexey
I have my storj service startup in delayed mode.
as mentioned this was only a test node to figure things out and looking forward to convert it to docker version running from my synology server.
So this morning I found my node offline again ( why the hell is this happening only on weekends?) PC was online and seemed that haven’t rebooted over the weekend on its own and no updates were done, same with synology server was online and no errors in log.
so I looked up for clues: 1. I checked windows event viewer to see any errors related to storj and found one: “The Storj V3 Storage Node service terminated unexpectedly. It has done this 1 time(s).”
2. I looked in the Storj log file and I saw bunch of upload and download error messages: 2023-08-18T18:37:18-04:00 ERROR piecestore upload failed {“Piece ID”: “PJLBWZN6MPTUVU53DHIAMBDECVBKD7KZWD5SHU3JNDVT5JJFLXTA”, “Satellite ID”: “12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S”, “Action”: “PUT”, “error”: “context canceled”, “errorVerbose”: “context canceled\n\tstorj.io/common/rpc/rpcstatus.Wrap:75\n\tstorj.io/storj/storagenode/piecestore.(*Endpoint).Upload.func6:500\n\tstorj.io/storj/storagenode/piecestore.(*Endpoint).Upload:506\n\tstorj.io/common/pb.DRPCPiecestoreDescription.Method.func1:243\n\tstorj.io/drpc/drpcmux.(*Mux).HandleRPC:33\n\tstorj.io/common/rpc/rpctracing.(*Handler).HandleRPC:61\n\tstorj.io/common/experiment.(*Handler).HandleRPC:42\n\tstorj.io/drpc/drpcserver.(*Server).handleRPC:124\n\tstorj.io/drpc/drpcserver.(*Server).ServeOne:66\n\tstorj.io/drpc/drpcserver.(*Server).Serve.func2:114\n\tstorj.io/drpc/drpcctx.(*Tracker).track:35”, “Size”: 36864, “Remote Address”: “5.161.149.40:8678”}
2023-08-18T18:37:39-04:00 ERROR piecestore download failed {“Piece ID”: “IFG4MVIOVTDUF4OYCK3FDVQRXA5ODCGBHSCYIGKQCZOFDZI7TYWQ”, “Satellite ID”: “12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs”, “Action”: “GET”, “Offset”: 0, “Size”: 311296, “Remote Address”: “51.77.227.245:50426”, “error”: “manager closed: read tcp 10.0.1.33:28967->51.77.227.245:50426: wsarecv: A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond.”, “errorVerbose”: “manager closed: read tcp 10.0.1.33:28967->51.77.227.245:50426: wsarecv: A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond.\n\tstorj.io/drpc/drpcmanager.(*Manager).manageReader:231”}

Does any of those give you any clues what caused to stop storj node? should I look for any other clues elsewhere?
so windows error came up this morning, but storj started to throw those upload and download errors late evening 08.18.23. till this morning before storj service crushed.

forgot to post this fatal error as well that came up in storj log file this morning before node service crush: 2023-08-21T00:52:50-04:00 FATAL Unrecoverable error {“error”: “piecestore monitor: timed out after 1m0s while verifying writability of storage directory”, “errorVerbose”: “piecestore monitor: timed out after 1m0s while verifying writability of storage directory\n\tstorj.io/storj/storagenode/monitor.(*Service).Run.func2.1:169\n\tstorj.io/common/sync2.(*Cycle).Run:160\n\tstorj.io/storj/storagenode/monitor.(*Service).Run.func2:161\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:75”}

This means your drive has a bottleneck.
Read here.

Loading the newest network card drivers from the manufacturer homepage is always recommended.

1 Like

Because of:

As I said - the disk subsystem is too slow. It kind of expected since you use a network attached storage. In this case you would need to increase this writeability timeout:

I would expect that it can have a readability timeouts too, so you would increase two parameters for that case:

if any of timeouts would reach 5 minutes after tuning, I will recommend to reconsider your setup.

Hi
so I made this 4 changes as I understood:

# how frequently to verify the location and readability of the storage directory
 storage2.monitor.verify-dir-readable-interval: 1m30s

# how long to wait for a storage directory readability verification to complete
storage2.monitor.verify-dir-readable-timeout: 1m30s

# how frequently to verify writability of storage directory
 storage2.monitor.verify-dir-writable-interval: 1m30s

# how long to wait for a storage directory writability verification to complete
storage2.monitor.verify-dir-writable-timeout: 1m30s

Correct?
I’ll play with those settings and keep it running somehow before I switch to synology docker.

Hi
yes I got all latest drivers from manufacturers website not generic ones.

Not exactly. You should increase only those timeouts, which affects the node crash, not everyone.
Improperly changed they could make a harm. For example, your disk become corrupted, but readability timeout didn’t detect it and didn’t shutdown the node to protect from disqualification.
How is it related to disqualification? For this example, when the disk is dying, the node cannot provide a piece for audit after 5 minutes timeout and did so 2 more times, such an audit will be considered as failed. Several failed audits like that and node will be disqualified.
If the readability timeout were shorter, this internal monitoring will stop the node even before it would start to fail audits.

In case if you have had a crash because of readability timeout was exceed during the check, and you did recommended actions (checked and fixed the disk, performed a defragmentation, but the readability errors are still occurs), you may slowly increase the readable timeout and readable interval, because they are both 1m0s by default.
Here should be no spaces before the option, otherwise node may not start due to incorrect YAML format.

it should be:

# how frequently to verify the location and readability of the storage directory
storage2.monitor.verify-dir-readable-interval: 1m30s

If your node suffer to write on the disk, and you performed recommended actions (checked and fixed the disk, performed a defragmentation but the writeability errors are still occurs), you may slowly increase a writeable timeout, but not writeable checks interval, because they are different by default (the writeable timeout is 1m0s by default, but the writeable checks interval is 5m0s by default).

So you should not change the writeable checks interval, unless you increased a writeable timeout greater than 5m0s (which is a red alert for your disk system already and you need to check a disk surface and S.M.A.R.T.).
So, you should not add/uncomment this parameter:

and again, if you added this parameter, it should not have spaces before it. To comment out you may add a # character before it,

# how frequently to verify writability of storage directory
# storage2.monitor.verify-dir-writable-interval: 5m0s

Save the config and restart the node.

ok got it. I just verified from log file that fatal errors were caused by: verify-dir-readable-timeout and verify-dir-writable-timeout. So I extended them to 1m30s and returned other 2 to it’s default values and commented out as it was before change. I’ll monitor few days see what happens and will keep you posted. thanks again for your time, help and clarifying.

ok so I was looking on log file again and seeing lot of this upload and download errors from this satellite us1.storj.io:7777
error:

2023-08-23T14:09:33-04:00 ERROR piecestore upload failed {Piece ID: 3P6L4FYOQYAPJXQTXJ2S2NR5YKCRTZTPWV44I7W3INRFPGATTM7A, Satellite ID: 12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S, Action: PUT, error: context canceled, errorVerbose: context canceled\n\tstorj.io/common/rpc/rpcstatus.Wrap:75\n\tstorj.io/storj/storagenode/piecestore.(*Endpoint).Upload.func6:500\n\tstorj.io/storj/storagenode/piecestore.(*Endpoint).Upload:506\n\tstorj.io/common/pb.DRPCPiecestoreDescription.Method.func1:243\n\tstorj.io/drpc/drpcmux.(*Mux).HandleRPC:33\n\tstorj.io/common/rpc/rpctracing.(*Handler).HandleRPC:61\n\tstorj.io/common/experiment.(*Handler).HandleRPC:42\n\tstorj.io/drpc/drpcserver.(*Server).handleRPC:124\n\tstorj.io/drpc/drpcserver.(*Server).ServeOne:66\n\tstorj.io/drpc/drpcserver.(*Server).Serve.func2:114\n\tstorj.io/drpc/drpcctx.(*Tracker).track:35, Size: 10240, Remote Address: 5.161.149.40:11876}
2023-08-23T14:09:37-04:00 ERROR piecestore download failed {Piece ID: SAX7CIVEKWJ25YUIFPB5THTE6UB5AR6VXHJJTBEIG6TAJ2JZXUMQ, Satellite ID: 12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S, Action: GET, Offset: 0, Size: 540672, Remote Address: 5.161.207.152:64112, error: write tcp 10.0.1.33:28967->5.161.207.152:64112: wsasend: An existing connection was forcibly closed by the remote host., errorVerbose: write tcp 10.0.1.33:28967->5.161.207.152:64112: wsasend: An existing connection was forcibly closed by the remote host.\n\tstorj.io/drpc/drpcstream.(*Stream).rawFlushLocked:401\n\tstorj.io/drpc/drpcstream.(*Stream).MsgSend:462\n\tstorj.io/common/pb.(*drpcPiecestore_DownloadStream).Send:349\n\tstorj.io/storj/storagenode/piecestore.(*Endpoint).sendData.func1:816\n\tstorj.io/common/rpc/rpctimeout.Run.func1:22}

is this error caused from my side?

I will try to explain one more time.
The readability check interval has a default value of 1m0s. The readability timeout is 1m0s by default too.
So if you change a readability timeout, you also need to change its check interval, otherwise it will checks more often than a timeout, they will be overlapped and likely could lead to crashing node more often.

The writeability check interval is 5m0s by default, the writeability timeout is 1m0s by default (they are different).
So if you change the writeability timeout you would change its check interval only if the writeability timeout is greater than 5m0s.

I hope that now you understand it better.
So, if you have both timeout errors, you need to increase:

These errors

means that the remote host is closed the connection because of a long tail cancelation - your node is loosed the race for pieces. If you see a lot of such errors, it could be your router though.

ok got it, made changes.
So what sort of problem I should be looking in to my router?

Does reboot of the router reduces number of errors? If not, then it’s long tail cancelation errors only.