StorJ Node v1.61.1 STOP several time after few minute from last 2 hours

maheryu · February 12, 2023, 6:46pm

I have tryed everything i can remember of and no result. When i enable UDP it stops in seconds with message:
eFhdxYxKicsiFmxrsLZHeLUtdps3S", “Action”: “GET”, “Offset”: 0, “Size”: 2319360, “Remote Address”: “5.161.91.88:30282”}
2023-02-12T19:43:08.296+0100 ERROR servers unexpected shutdown of a runner {“name”: “server”, “error”: “read udp [::]:28967: wsarecvfrom: A message sent on a datagram socket was larger than the internal message buffer or some other network limit, or the buffer used to receive a datagram into was smaller than the datagram itself.”, “errorVerbose”: “read udp [::]:28967: wsarecvfrom: A message sent on a datagram socket was larger than the internal message buffer or some other network limit, or the buffer used to receive a datagram into was smaller than the datagram itself.\n\tstorj.io/drpc/drpcserver.(*Server).Serve:107\n\tstorj.io/storj/private/server.(*Server).Run.func4.1:367\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:75”}
2023-02-12T19:43:08.296+0100 ERROR pieces:trash emptying trash failed {“error”: “pieces error: filestore error: context canceled”, “errorVerbose”: “pieces error: filestore error: context canceled\n\tstorj.io/storj/storage/filestore.(*blobStore).EmptyTrash:156\n\tstorj.io/storj/storagenode/pieces.(*BlobsUsageCache).EmptyTrash:316\n\tstorj.io/storj/storagenode/pieces.(*Store).EmptyTrash:381\n\tstorj.io/storj/storagenode/pieces.(*TrashChore).Run.func1.1:83\n\tstorj.io/common/sync2.(*Workplace).Start.func1:89”}
2023-02-12T19:43:08.304+0100 ERROR piecestore:cache error getting current used space: {“error”: “context canceled; context canceled; context canceled; context canceled; context canceled; context canceled”, “errorVerbose”: “group:\n— context canceled\n— context canceled\n— context canceled\n— context canceled\n— context canceled\n— context canceled”}
2023-02-12T19:43:09.422+0100 FATAL Unrecoverable error {“error”: “read udp [::]:28967: wsarecvfrom: A message sent on a datagram socket was larger than the internal message buffer or some other network limit, or the buffer used to receive a datagram into was smaller than the datagram itself.”, “errorVerbose”: “read udp [::]:28967: wsarecvfrom: A message sent on a datagram socket was larger than the internal message buffer or some other network limit, or the buffer used to receive a datagram into was smaller than the datagram itself.\n\tstorj.io/drpc/drpcserver.(*Server).Serve:107\n\tstorj.io/storj/private/server.(*Server).Run.func4.1:367\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:75”}

coinbirds · February 12, 2023, 8:15pm

023-02-11T20:00:13.272+0100 ERROR servers unexpected shutdown of a runner {“name”: “server”, “error”: “read udp [::]:28967: wsarecvfrom: A message sent on a datagram socket was larger than the internal message buffer or some other network limit, or the buffer used to receive a datagram into was smaller than the datagram itself.”, “errorVerbose”: “read udp [::]:28967: wsarecvfrom: A message sent on a datagram socket was larger than the internal message buffer or some other network limit, or the buffer used to receive a datagram into was smaller than the datagram itself.\n\tstorj.io/drpc/drpcserver.(*Server).Serve:107\n\tstorj.io/storj/private/server.(*Server).Run.func4.1:367\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:75”}
2023-02-11T20:00:13.275+0100 ERROR pieces:trash emptying trash failed {“error”: “pieces error: filestore error: context canceled”, “errorVerbose”: “pieces error: filestore error: context canceled\n\tstorj.io/storj/storage/filestore.(*blobStore).EmptyTrash:156\n\tstorj.io/storj/storagenode/pieces.(*BlobsUsageCache).EmptyTrash:316\n\tstorj.io/storj/storagenode/pieces.(*Store).EmptyTrash:381\n\tstorj.io/storj/storagenode/pieces.(*TrashChore).Run.func1.1:83\n\tstorj.io/common/sync2.(*Workplace).Start.func1:89”}
2023-02-11T20:00:13.286+0100 ERROR piecestore:cache error getting current used space: {“error”: “context canceled; context canceled; context canceled; context canceled; context canceled; context canceled”, “errorVerbose”: “group:\n— context canceled\n— context canceled\n— context canceled\n— context canceled\n— context canceled\n— context canceled”}
2023-02-11T20:00:13.508+0100 FATAL Unrecoverable error {“error”: “read udp [::]:28967: wsarecvfrom: A message sent on a datagram socket was larger than the internal message buffer or some other network limit, or the buffer used to receive a datagram into was smaller than the datagram itself.”, “errorVerbose”: “read udp [::]:28967: wsarecvfrom: A message sent on a datagram socket was larger than the internal message buffer or some other network limit, or the buffer used to receive a datagram into was smaller than the datagram itself.\n\tstorj.io/drpc/drpcserver.(*Server).Serve:107\n\tstorj.io/storj/private/server.(*Server).Run.func4.1:367\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:75”}

Alexey · April 26, 2023, 1:04am

github.com/storj/storj

As soon as i activate UDP everything stops with same message

opened 09:09AM - 12 Feb 23 UTC

Arsastorj

Bug

As soon as i activate UDP everything stops with same message :( Unrecoverable e…rror {“error”: “read udp [::]:28967: wsarecvfrom: A message sent on a datagram socket was larger than the internal message buffer or some other network limit, or the buffer used to receive a datagram into was smaller than the datagram itself.”, “errorVerbose”: “read udp [::]:28967: it is happening on only one node at this time Using Windows GUI gz#24491

thepaul · April 26, 2023, 4:37am

If anybody can capture a packet trace while this happens, that would really be invaluable in helping us solve it. Capturing all UDP packets to/from the storagenode port (28967 by default) might be enough, but ideally I’d like to have ICMP packets included as well, in case something weird is happening with ‘Destination unreachable’ or ‘Time exceeded’ response messages.

Alexey · April 26, 2023, 6:57am

Please guide, how to do so.

thepaul · April 27, 2023, 2:32pm

This would involve using a packet capture tool such as Wireshark. I don’t think I can put together very helpful instructions here, given all the different possible platforms, but it has pretty good documentation.

Alexey · April 28, 2023, 8:01am

The problem is detected only on Windows GUI nodes, other platforms seems not affected.

thepaul · April 28, 2023, 4:31pm

Download Wireshark here: Wireshark · Download
Installation instructions for Wireshark on Windows: 2.3. Installing Wireshark under Windows

I think the default installation options should work perfectly.

Once it’s running, in the “Capture” area of the welcome screen, where it says “Using this filter:” enter the filter port 28967 (or whatever port your node is running on). Hitting Enter should start a new packet capture session. It will record all incoming and outgoing packets on connections to the specified port, on all of your network interfaces.

While the capture is running, turn on UDP or enable QUIC or do whatever it is that causes the node to crash for you.

After the node has crashed, you can stop the packet capture in Wireshark (the red square should do it) and save the capture to a file, which you can then send along with the node logs.

thepaul · May 4, 2023, 9:08pm

I’ve got a packet capture from someone! It’s a long one, and I don’t have a log to go with it, so it’s not 100% clear when the failure happened, but I think I have isolated the problem packet. It’s a fragmented piece of a BitTorrent DHT request(*) where the remaining followup fragments got lost or never existed. I can replicate the packet; now I just need a Windows node to test and make sure I’ve got the right thing.

My theory is that the network stacks on Linux, Darwin, and Windows treat this situation differently, and that it only produces an observable problem on Windows.

Further work will need to wait until I can wrangle my old Windows laptop into running a node, so I can see if this packet kills it. Or! If someone out there has a Windows node and doesn’t mind me possibly killing it, let me know.

(*) Why are there hundreds of nodes out on the Internet thinking this node is a participant in a BitTorrent DHT? I have no idea whatsoever. Very curious to see if there are similar BT-DHT queries going to everyone else’s nodes as well.

thepaul · May 4, 2023, 11:52pm

Update: I have a reasonably good test setup put together, running a QUIC server in the same way that the storagenode does, on Windows. But the “bad packet” doesn’t disturb it at all. I still can’t reproduce the problem. I’d like to try sending the packet to a node where we know the problem exists, in case it’s something to do with Windows version or Windows configuration. Any volunteers?