StorJ Node v1.61.1 STOP several time after few minute from last 2 hours

2022-08-14T15:17:15.900+0200 INFO piecestore downloaded {“Piece ID”: “JOUUC3YOHZEVJFJM74EZRAUJQ527NQGHOH3BQYNJRQIL5XKVZS6Q”, “Satellite ID”: “12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S”, “Action”: “GET”}
2022-08-14T15:17:15.905+0200 INFO piecestore downloaded {“Piece ID”: “6SHOHOTMYYJUO73WJLMQM43OQH62ZBESOOKMVSP2Q4N5OAYMXJQQ”, “Satellite ID”: “12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S”, “Action”: “GET”}
2022-08-14T15:17:15.906+0200 INFO piecestore upload canceled {“Piece ID”: “MXGSRGMW24Q4NEHH3JZAGKJGUZKKHNDKKLZDQ2VSXK4TZZW5JECQ”, “Satellite ID”: “12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S”, “Action”: “PUT”, “Size”: 65536}
2022-08-14T15:17:15.906+0200 INFO piecestore downloaded {“Piece ID”: “YUKZD45FQNQ52XGEHFDZLPVNXTCVX5FEY4UBMA4S4QZF5DKVELAQ”, “Satellite ID”: “12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S”, “Action”: “GET”}
2022-08-14T15:17:15.906+0200 ERROR servers unexpected shutdown of a runner {“name”: “server”, “error”: “read udp [::]:28967: wsarecvfrom: A message sent on a datagram socket was larger than the internal message buffer or some other network limit, or the buffer used to receive a datagram into was smaller than the datagram itself.”, “errorVerbose”: “read udp [::]:28967: wsarecvfrom: A message sent on a datagram socket was larger than the internal message buffer or some other network limit, or the buffer used to receive a datagram into was smaller than the datagram itself.\n\tstorj.io/drpc/drpcserver.(*Server).Serve:107\n\tstorj.io/storj/private/server.(*Server).Run.func5:227\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:57”}
2022-08-14T15:17:15.910+0200 ERROR pieces:trash emptying trash failed {“error”: “pieces error: filestore error: context canceled”, “errorVerbose”: “pieces error: filestore error: context canceled\n\tstorj.io/storj/storage/filestore.(*blobStore).EmptyTrash:154\n\tstorj.io/storj/storagenode/pieces.(*BlobsUsageCache).EmptyTrash:310\n\tstorj.io/storj/storagenode/pieces.(*Store).EmptyTrash:367\n\tstorj.io/storj/storagenode/pieces.(*TrashChore).Run.func1:51\n\tstorj.io/common/sync2.(*Cycle).Run:99\n\tstorj.io/common/sync2.(*Cycle).Start.func1:77\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:57”}
2022-08-14T15:17:15.911+0200 ERROR pieces:trash emptying trash failed {“error”: “pieces error: filestore error: context canceled”, “errorVerbose”: “pieces error: filestore error: context canceled\n\tstorj.io/storj/storage/filestore.(*blobStore).EmptyTrash:154\n\tstorj.io/storj/storagenode/pieces.(*BlobsUsageCache).EmptyTrash:310\n\tstorj.io/storj/storagenode/pieces.(*Store).EmptyTrash:367\n\tstorj.io/storj/storagenode/pieces.(*TrashChore).Run.func1:51\n\tstorj.io/common/sync2.(*Cycle).Run:99\n\tstorj.io/common/sync2.(*Cycle).Start.func1:77\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:57”}
2022-08-14T15:17:15.925+0200 ERROR pieces:trash emptying trash failed {“error”: “pieces error: filestore error: context canceled”, “errorVerbose”: “pieces error: filestore error: context canceled\n\tstorj.io/storj/storage/filestore.(*blobStore).EmptyTrash:154\n\tstorj.io/storj/storagenode/pieces.(*BlobsUsageCache).EmptyTrash:310\n\tstorj.io/storj/storagenode/pieces.(*Store).EmptyTrash:367\n\tstorj.io/storj/storagenode/pieces.(*TrashChore).Run.func1:51\n\tstorj.io/common/sync2.(*Cycle).Run:99\n\tstorj.io/common/sync2.(*Cycle).Start.func1:77\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:57”}
2022-08-14T15:17:15.940+0200 ERROR piecestore:cache error getting current used space: {“error”: “context canceled; context canceled; context canceled; context canceled; context canceled; context canceled”, “errorVerbose”: “group:\n— context canceled\n— context canceled\n— context canceled\n— context canceled\n— context canceled\n— context canceled”}
2022-08-14T15:17:15.985+0200 FATAL Unrecoverable error {“error”: “read udp [::]:28967: wsarecvfrom: A message sent on a datagram socket was larger than the internal message buffer or some other network limit, or the buffer used to receive a datagram into was smaller than the datagram itself.”, “errorVerbose”: “read udp [::]:28967: wsarecvfrom: A message sent on a datagram socket was larger than the internal message buffer or some other network limit, or the buffer used to receive a datagram into was smaller than the datagram itself.\n\tstorj.io/drpc/drpcserver.(*Server).Serve:107\n\tstorj.io/storj/private/server.(*Server).Run.func5:227\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:57”}

The possible workaround could be to disable UDP rule in your firewall, your node will complain about QUIC misconfigured, but at least it will be online.

Created a bug report: [storagenode] storagenode stop working if receives a QUIC request · Issue #5080 · storj/storj · GitHub

1 Like

Thank you for your quick reply.
I am not a pc expert and thought I would not be able to do it.
I was afraid that the satellites would be blocked.
I started new in and outbound rules with block UDP port 28967 in windows firewall

No problems for more than 2 hours

There is no need to create any new rules. You should already have rules for TCP and UDP 28967 ports in inbound rules.
Please remove your two new blocking rules, which you have created now, especially outbound ones (there should not be any active rules, otherwise your node likely will stop provide answers to the customers and satellites).
Find the rule for 28967 UDP in inbound rules and deactivate it, this should be enough to stop accept UDP traffic until bug will be fixed. Your dashboard will show QUIC misconfigured after the nearest restart of the storagenode service.

1 Like

So we’re receiving a UDP packet larger than we expect the largest packet to be.

Presumably, the buffer allocated for reception of QUIC packets is as large as the Path MTU (as discovered by quic-go). Some possible reasons for receiving a larger packet:

A. PMTU discovery is not working right.
B. the Path MTU is not the same in each direction.
C. the remote side is testing whether the path MTU has become larger, and it has.
D. the large packet was spoofed on the local network where there is a higher MTU.

If (A), this would be a bug in QUIC code, in our dependency quic-go library. If we can reproduce, it may help to get a tcpdump trace.

But in any case, the right thing to do when we see that error is to drop the packet and carry on listening for more.

This sounds like a vector for denial of service.

Yes, indeed. Crashing on reception of a random packet is probably not the best feature.

So, after trying several different ways I haven’t been able to reproduce this behavior. It’s possible that it was fixed upstream in a newer release of the QUIC library that we use.

I’ve tried on Windows and Linux, using two different methods:

  1. Sending a specially crafted overlarge packet to the listening QUIC port, on the loopback interface where the MTU is much larger than 1500
  2. Changing the hardcoded packet buffer size in a storagenode build to something much smaller than the MTU, and allowing regular traffic to the QUIC port

Neither of these approaches caused the node to crash. If you have any more information on the problem that might help us reproduce, or if you’re still having the problem, please let me know.

3 Likes

The bug reappeared in v1.63.1 win GUI.
Last night I was woken up 2x by the crash.
Error fixed for now:
28967 UDP in inbound rules deactivate