Connection reset by peer errors

Toyoo · March 21, 2023, 8:34am

Success rate script only uses storage node output of logs. And the issue is about interpretation of the storage node output, not the success rate script itself.

ItsHass · March 21, 2023, 12:55pm

I’m still seeing this … is there an update soon that fixes this ?

BrightSilence · March 21, 2023, 1:00pm

I don’t think it’s correct to assume these numbers are wrong. On average 10 out of 39 downloads are canceled. That’s 26%. Maybe your node is a little slower than average and we’re finally seeing the actual numbers pop up in the logs.

For what it’s worth, I’m seeing a lot more canceled downloads as well.

ItsHass · March 21, 2023, 1:37pm

@BrightSilence good to know, maybe it just wasnt being reported into logs before now…

dankasak · March 21, 2023, 10:13pm

Hi all. I’m a relatively new SNO. Everything has been running fine, but for the past day or so, I’m seeing an error like this every couple of seconds:

2023-03-21T22:00:00.935Z        ERROR   piecestore      download failed {"Process": "storagenode", "Piece ID": "Y53PAJPYHPBRPRDEBTFDYV4E6UOPTKDBPGTUS2JY55JJYSDO6UPQ", "Satellite ID": "12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs", "Action": "GET", "Offset": 1221120, "Size": 2560, "Remote Address": "184.104.224.98:40764", "error": "manager closed: read tcp 10.88.0.3:28967->184.104.224.98:40764: read: connection reset by peer", "errorVerbose": "manager closed: read tcp 10.88.0.3:28967->184.104.224.98:40764: read: connection reset by peer\n\tgithub.com/jtolio/noiseconn.(*Conn).readMsg:183\n\tgithub.com/jtolio/noiseconn.(*Conn).Read:143\n\tstorj.io/drpc/drpcwire.(*Reader).ReadPacketUsing:96\n\tstorj.io/drpc/drpcmanager.(*Manager).manageReader:223"}

… or …

2023-03-21T21:59:40.359Z        ERROR   piecestore      upload failed   {"Process": "storagenode", "Piece ID": "AYNAEOIVJHJCR4TCJ2EGZNJVW5J2WFNIPJF2VHTNDFDGIV3IUVXA", "Satellite ID": "12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs", "Action": "PUT", "error": "manager closed: read tcp 10.88.0.3:28967->72.52.83.203:10610: read: connection reset by peer", "errorVerbose": "manager closed: read tcp 10.88.0.3:28967->72.52.83.203:10610: read: connection reset by peer\n\tgithub.com/jtolio/noiseconn.(*Conn).readMsg:183\n\tgithub.com/jtolio/noiseconn.(*Conn).Read:143\n\tstorj.io/drpc/drpcwire.(*Reader).ReadPacketUsing:96\n\tstorj.io/drpc/drpcmanager.(*Manager).manageReader:223", "Size": 0, "Remote Address": "72.52.83.203:10610"}

I’m also seeing for just us2.storj.io:7777 my “suspension” metric around 80% now. I’ve checked:

[dankasak@mercury photoprism]$ sudo docker logs storagenode 2>&1 | grep -E "GET_AUDIT|GET_REPAIR" | grep failed -c
2
[dankasak@mercury photoprism]$

( as I found in the forums ). This doesn’t look “too” bad?

Possibly related: I’m doing an initial push of around 300GB into storj via “rclone”, and this is taking multiple days. I’ve bandwidth-limited this somewhat, but upload traffic is still a little busy.

Does anyone know what’s happening?

Alexey · March 22, 2023, 4:21am

Hello @dankasak ,
Welcome to the forum!

daki82 · March 28, 2023, 2:48pm

maybe thats why( logfiles are bloating(more than usual)) and nodeservices stopping unexpectedly since the dates matches with the first occurance of the 1m timeout error (23.-24-03.23 guessed)

wanted to point that out since my node onlinescore goes down slowly and also suspension shows the ~3min downtime until the service-down-detected-restart from windows a couple times a day.

cpare · March 29, 2023, 2:06am

Change has been checked in (and merged) to reduce the errors making it to the logs
https://review.dev.storj.io/c/storj/drpc/+/9994

archimede91 · April 11, 2023, 2:29pm

I have lastest storage node image, but still lots of reset by peer errors, my download success rate is 96%, is this normal?

Grimm · April 11, 2023, 4:22pm

I am getting almost 30% total egress decrease, from 13Gb to 8Gb upload a day in the last 5 days.
One of my nodes used to egress 6Gb a day now it’s getting only 2,5Gb in the past 2 days.
It’s very weird and inconsistent.

Vadim · April 11, 2023, 6:15pm

it is clients egress it no predictoble

archimede91 · April 11, 2023, 8:28pm

But lower eggress cant be correlated to the connection reset by Peer influence on reputation and/or failed download?

cpare · April 11, 2023, 10:32pm

Welcome to the forum!

nerdatwork · April 12, 2023, 2:19am

You cannot do anything about it. Your node is not fast enough to compete in the race.

It depends on why your download failed. If it is because of disk issue then disk needs to be checked. If it is due to connection reset then your node was too slow to respond to the request for the piece.

PS: Storj isn’t mining. All we can do is make sure our node is kept online 24/7

archimede91 · April 12, 2023, 5:35am

Not fast enought? With a 10G dedicated uplink?

Vadim · April 12, 2023, 6:13am

uplink speed is not always main factore. paket rount time to client + hdd write speed is main one.
HDD connected by sata? there is lot of factors. As hdd not only write but also read a lot of files, it spend also time for seeking this files, it cant make both things at same time, so is it make it in order.

archimede91 · April 12, 2023, 7:22am

Ok, thanks Vadim, it makes sense, I have hdd on sata and no ssd for cache maybe it’s this the problem, do you know any good scripts to test the performance for this?

Plenty of RAM, 32 gb, it’s one new node with only 3TB maximum storage, and I currently have about 50gb storage .

Vadim · April 12, 2023, 9:23am

do you use this setup in config ?
Filestore.write-buffer-size: 4 MiB
it use more ram, lot more, but it not buffer piece to temp and write it at once to hdd. save some IO to hdd.
As i seen that max size is 4MB of file.

archimede91 · April 12, 2023, 9:33am

inside the container?

if so I have this: root@mycontainer:/app/config# dir
config.yaml orders revocations.db storage trust-cache.json

cat config.yaml | grep “filestore”

filestore.write-buffer-size: 128.0 KiB

@Vadim

Vadim · April 12, 2023, 9:54am

it is default setup, but files are usualy bigger, and then it write part to hdd, then next part then next every time up to 128k but with 4MB it will write all at once.