StorJ Node v1.61.1 STOP several time after few minute from last 2 hours

coinbirds · August 14, 2022, 1:41pm

2022-08-14T15:17:15.900+0200 INFO piecestore downloaded {“Piece ID”: “JOUUC3YOHZEVJFJM74EZRAUJQ527NQGHOH3BQYNJRQIL5XKVZS6Q”, “Satellite ID”: “12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S”, “Action”: “GET”}
2022-08-14T15:17:15.905+0200 INFO piecestore downloaded {“Piece ID”: “6SHOHOTMYYJUO73WJLMQM43OQH62ZBESOOKMVSP2Q4N5OAYMXJQQ”, “Satellite ID”: “12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S”, “Action”: “GET”}
2022-08-14T15:17:15.906+0200 INFO piecestore upload canceled {“Piece ID”: “MXGSRGMW24Q4NEHH3JZAGKJGUZKKHNDKKLZDQ2VSXK4TZZW5JECQ”, “Satellite ID”: “12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S”, “Action”: “PUT”, “Size”: 65536}
2022-08-14T15:17:15.906+0200 INFO piecestore downloaded {“Piece ID”: “YUKZD45FQNQ52XGEHFDZLPVNXTCVX5FEY4UBMA4S4QZF5DKVELAQ”, “Satellite ID”: “12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S”, “Action”: “GET”}
2022-08-14T15:17:15.906+0200 ERROR servers unexpected shutdown of a runner {“name”: “server”, “error”: “read udp [::]:28967: wsarecvfrom: A message sent on a datagram socket was larger than the internal message buffer or some other network limit, or the buffer used to receive a datagram into was smaller than the datagram itself.”, “errorVerbose”: “read udp [::]:28967: wsarecvfrom: A message sent on a datagram socket was larger than the internal message buffer or some other network limit, or the buffer used to receive a datagram into was smaller than the datagram itself.\n\tstorj.io/drpc/drpcserver.(*Server).Serve:107\n\tstorj.io/storj/private/server.(*Server).Run.func5:227\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:57”}
2022-08-14T15:17:15.910+0200 ERROR pieces:trash emptying trash failed {“error”: “pieces error: filestore error: context canceled”, “errorVerbose”: “pieces error: filestore error: context canceled\n\tstorj.io/storj/storage/filestore.(*blobStore).EmptyTrash:154\n\tstorj.io/storj/storagenode/pieces.(*BlobsUsageCache).EmptyTrash:310\n\tstorj.io/storj/storagenode/pieces.(*Store).EmptyTrash:367\n\tstorj.io/storj/storagenode/pieces.(*TrashChore).Run.func1:51\n\tstorj.io/common/sync2.(*Cycle).Run:99\n\tstorj.io/common/sync2.(*Cycle).Start.func1:77\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:57”}
2022-08-14T15:17:15.911+0200 ERROR pieces:trash emptying trash failed {“error”: “pieces error: filestore error: context canceled”, “errorVerbose”: “pieces error: filestore error: context canceled\n\tstorj.io/storj/storage/filestore.(*blobStore).EmptyTrash:154\n\tstorj.io/storj/storagenode/pieces.(*BlobsUsageCache).EmptyTrash:310\n\tstorj.io/storj/storagenode/pieces.(*Store).EmptyTrash:367\n\tstorj.io/storj/storagenode/pieces.(*TrashChore).Run.func1:51\n\tstorj.io/common/sync2.(*Cycle).Run:99\n\tstorj.io/common/sync2.(*Cycle).Start.func1:77\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:57”}
2022-08-14T15:17:15.925+0200 ERROR pieces:trash emptying trash failed {“error”: “pieces error: filestore error: context canceled”, “errorVerbose”: “pieces error: filestore error: context canceled\n\tstorj.io/storj/storage/filestore.(*blobStore).EmptyTrash:154\n\tstorj.io/storj/storagenode/pieces.(*BlobsUsageCache).EmptyTrash:310\n\tstorj.io/storj/storagenode/pieces.(*Store).EmptyTrash:367\n\tstorj.io/storj/storagenode/pieces.(*TrashChore).Run.func1:51\n\tstorj.io/common/sync2.(*Cycle).Run:99\n\tstorj.io/common/sync2.(*Cycle).Start.func1:77\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:57”}
2022-08-14T15:17:15.940+0200 ERROR piecestore:cache error getting current used space: {“error”: “context canceled; context canceled; context canceled; context canceled; context canceled; context canceled”, “errorVerbose”: “group:\n— context canceled\n— context canceled\n— context canceled\n— context canceled\n— context canceled\n— context canceled”}
2022-08-14T15:17:15.985+0200 FATAL Unrecoverable error {“error”: “read udp [::]:28967: wsarecvfrom: A message sent on a datagram socket was larger than the internal message buffer or some other network limit, or the buffer used to receive a datagram into was smaller than the datagram itself.”, “errorVerbose”: “read udp [::]:28967: wsarecvfrom: A message sent on a datagram socket was larger than the internal message buffer or some other network limit, or the buffer used to receive a datagram into was smaller than the datagram itself.\n\tstorj.io/drpc/drpcserver.(*Server).Serve:107\n\tstorj.io/storj/private/server.(*Server).Run.func5:227\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:57”}

Alexey · August 14, 2022, 3:07pm

The possible workaround could be to disable UDP rule in your firewall, your node will complain about QUIC misconfigured, but at least it will be online.

Created a bug report: [storagenode] storagenode stop working if receives a QUIC request · Issue #5080 · storj/storj · GitHub

coinbirds · August 14, 2022, 6:06pm

Thank you for your quick reply.
I am not a pc expert and thought I would not be able to do it.
I was afraid that the satellites would be blocked.
I started new in and outbound rules with block UDP port 28967 in windows firewall

No problems for more than 2 hours

Alexey · August 15, 2022, 6:15am

There is no need to create any new rules. You should already have rules for TCP and UDP 28967 ports in inbound rules.
Please remove your two new blocking rules, which you have created now, especially outbound ones (there should not be any active rules, otherwise your node likely will stop provide answers to the customers and satellites).
Find the rule for 28967 UDP in inbound rules and deactivate it, this should be enough to stop accept UDP traffic until bug will be fixed. Your dashboard will show QUIC misconfigured after the nearest restart of the storagenode service.

thepaul · August 25, 2022, 12:50pm

So we’re receiving a UDP packet larger than we expect the largest packet to be.

Presumably, the buffer allocated for reception of QUIC packets is as large as the Path MTU (as discovered by quic-go). Some possible reasons for receiving a larger packet:

A. PMTU discovery is not working right.
B. the Path MTU is not the same in each direction.
C. the remote side is testing whether the path MTU has become larger, and it has.
D. the large packet was spoofed on the local network where there is a higher MTU.

If (A), this would be a bug in QUIC code, in our dependency quic-go library. If we can reproduce, it may help to get a tcpdump trace.

But in any case, the right thing to do when we see that error is to drop the packet and carry on listening for more.

Toyoo · August 25, 2022, 12:54pm

This sounds like a vector for denial of service.

thepaul · August 25, 2022, 1:22pm

Yes, indeed. Crashing on reception of a random packet is probably not the best feature.

thepaul · September 12, 2022, 11:05pm

So, after trying several different ways I haven’t been able to reproduce this behavior. It’s possible that it was fixed upstream in a newer release of the QUIC library that we use.

I’ve tried on Windows and Linux, using two different methods:

Sending a specially crafted overlarge packet to the listening QUIC port, on the loopback interface where the MTU is much larger than 1500
Changing the hardcoded packet buffer size in a storagenode build to something much smaller than the MTU, and allowing regular traffic to the QUIC port

Neither of these approaches caused the node to crash. If you have any more information on the problem that might help us reproduce, or if you’re still having the problem, please let me know.

coinbirds · September 25, 2022, 5:46pm

The bug reappeared in v1.63.1 win GUI.
Last night I was woken up 2x by the crash.
Error fixed for now:
28967 UDP in inbound rules deactivate

maheryu · February 11, 2023, 5:44pm

I had this problem today, deactivated UDP and it worked. Is this permanent or to activate UDP again?

coinbirds · February 11, 2023, 8:40pm

Unfortunately I don’t know why this problem is repeated with different version updates.
I have had it stop working again 3x for 2 hours.
I also turned off the UDP inbond rules

maheryu · February 11, 2023, 6:22pm

I had this problem today, deactivated UDP and it worked. Is this permanent or to activate UDP again?

Alexey · February 12, 2023, 2:53am

I’m not aware of this issue. It’s better to enable UDP to have more traffic from the customers who uses UDP.
I do not have any such problems with my three nodes (1 Windows GUI, 2 docker).
If this is happening with your node every time, I suggest to open an issue on our GitHub: Issues · storj/storj · GitHub

maheryu · February 12, 2023, 8:55am

As soon as i activate UDP everything stops with same message

maheryu · February 12, 2023, 9:12am

I opened it because i see i am not only one

TCC · February 12, 2023, 9:25am

Same problem happen to my node, workaround is to deactivate UDP and it works

Alexey · February 12, 2023, 11:12am

this is very weird, I have no problems with UDP and nodes stability. Is it a Windows GUI?
How many nodes do you have on this PC?

maheryu · February 12, 2023, 11:27am

I use Windows GUI, one node
I have 3 more on other locations and they are ok for now

maheryu · February 12, 2023, 11:29am

How did you fixed it last time?

Vadim · February 12, 2023, 11:34am

Looks like i have something like this, one of my node turn off third time without any error in logs.
my log level is WARN, and i see there only that someone download fails, it is obsoletely normal as I know