Connection reset by peer errors

Success rate script only uses storage node output of logs. And the issue is about interpretation of the storage node output, not the success rate script itself.

image

I’m still seeing this … is there an update soon that fixes this ?

I don’t think it’s correct to assume these numbers are wrong. On average 10 out of 39 downloads are canceled. That’s 26%. Maybe your node is a little slower than average and we’re finally seeing the actual numbers pop up in the logs.

For what it’s worth, I’m seeing a lot more canceled downloads as well.

2 Likes

@BrightSilence good to know, maybe it just wasnt being reported into logs before now…

Hi all. I’m a relatively new SNO. Everything has been running fine, but for the past day or so, I’m seeing an error like this every couple of seconds:

2023-03-21T22:00:00.935Z        ERROR   piecestore      download failed {"Process": "storagenode", "Piece ID": "Y53PAJPYHPBRPRDEBTFDYV4E6UOPTKDBPGTUS2JY55JJYSDO6UPQ", "Satellite ID": "12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs", "Action": "GET", "Offset": 1221120, "Size": 2560, "Remote Address": "184.104.224.98:40764", "error": "manager closed: read tcp 10.88.0.3:28967->184.104.224.98:40764: read: connection reset by peer", "errorVerbose": "manager closed: read tcp 10.88.0.3:28967->184.104.224.98:40764: read: connection reset by peer\n\tgithub.com/jtolio/noiseconn.(*Conn).readMsg:183\n\tgithub.com/jtolio/noiseconn.(*Conn).Read:143\n\tstorj.io/drpc/drpcwire.(*Reader).ReadPacketUsing:96\n\tstorj.io/drpc/drpcmanager.(*Manager).manageReader:223"}

… or …

2023-03-21T21:59:40.359Z        ERROR   piecestore      upload failed   {"Process": "storagenode", "Piece ID": "AYNAEOIVJHJCR4TCJ2EGZNJVW5J2WFNIPJF2VHTNDFDGIV3IUVXA", "Satellite ID": "12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs", "Action": "PUT", "error": "manager closed: read tcp 10.88.0.3:28967->72.52.83.203:10610: read: connection reset by peer", "errorVerbose": "manager closed: read tcp 10.88.0.3:28967->72.52.83.203:10610: read: connection reset by peer\n\tgithub.com/jtolio/noiseconn.(*Conn).readMsg:183\n\tgithub.com/jtolio/noiseconn.(*Conn).Read:143\n\tstorj.io/drpc/drpcwire.(*Reader).ReadPacketUsing:96\n\tstorj.io/drpc/drpcmanager.(*Manager).manageReader:223", "Size": 0, "Remote Address": "72.52.83.203:10610"}

I’m also seeing for just us2.storj.io:7777 my “suspension” metric around 80% now. I’ve checked:

[dankasak@mercury photoprism]$ sudo docker logs storagenode 2>&1 | grep -E "GET_AUDIT|GET_REPAIR" | grep failed -c
2
[dankasak@mercury photoprism]$ 

( as I found in the forums ). This doesn’t look “too” bad?

Possibly related: I’m doing an initial push of around 300GB into storj via “rclone”, and this is taking multiple days. I’ve bandwidth-limited this somewhat, but upload traffic is still a little busy.

Does anyone know what’s happening?

Hello @dankasak ,
Welcome to the forum!

maybe thats why( logfiles are bloating(more than usual)) and nodeservices stopping unexpectedly since the dates matches with the first occurance of the 1m timeout error (23.-24-03.23 guessed)

wanted to point that out since my node onlinescore goes down slowly and also suspension shows the ~3min downtime until the service-down-detected-restart from windows a couple times a day.

Change has been checked in (and merged) to reduce the errors making it to the logs
https://review.dev.storj.io/c/storj/drpc/+/9994

3 Likes

I have lastest storage node image, but still lots of reset by peer errors, my download success rate is 96%, is this normal?

I am getting almost 30% total egress decrease, from 13Gb to 8Gb upload a day in the last 5 days.
One of my nodes used to egress 6Gb a day now it’s getting only 2,5Gb in the past 2 days.
It’s very weird and inconsistent.

it is clients egress it no predictoble

2 Likes

But lower eggress cant be correlated to the connection reset by Peer influence on reputation and/or failed download?

Welcome to the forum!

You cannot do anything about it. Your node is not fast enough to compete in the race.

It depends on why your download failed. If it is because of disk issue then disk needs to be checked. If it is due to connection reset then your node was too slow to respond to the request for the piece.

PS: Storj isn’t mining. All we can do is make sure our node is kept online 24/7

Not fast enought? With a 10G dedicated uplink?

uplink speed is not always main factore. paket rount time to client + hdd write speed is main one.
HDD connected by sata? there is lot of factors. As hdd not only write but also read a lot of files, it spend also time for seeking this files, it cant make both things at same time, so is it make it in order.

Ok, thanks Vadim, it makes sense, I have hdd on sata and no ssd for cache maybe it’s this the problem, do you know any good scripts to test the performance for this?

Plenty of RAM, 32 gb, it’s one new node with only 3TB maximum storage, and I currently have about 50gb storage .

do you use this setup in config ?
Filestore.write-buffer-size: 4 MiB
it use more ram, lot more, but it not buffer piece to temp and write it at once to hdd. save some IO to hdd.
As i seen that max size is 4MB of file.

inside the container?

if so I have this: root@mycontainer:/app/config# dir
config.yaml orders revocations.db storage trust-cache.json

cat config.yaml | grep “filestore”

filestore.write-buffer-size: 128.0 KiB

@Vadim

it is default setup, but files are usualy bigger, and then it write part to hdd, then next part then next every time up to 128k but with 4MB it will write all at once.

1 Like