"GET_REPAIR”, “error”: “used serial already exists in store”

deathlessdd · April 1, 2022, 2:28am

Yes it seems to recover pretty quickly I think the lowest I seen was 90% suspended.

Bivvo · April 1, 2022, 10:45am

Here’s another interesting one, where the tcp connection has an issue as well @thepaul :

docker logs sn1 2>&1 | grep "K4OTLEQTMSVRK3NFXSKOSQZL3O7M6XDLISY342M7IAKR6W3M27TA"
2022-04-01T10:12:07.957Z	INFO	piecestore	download started	{"Piece ID": "K4OTLEQTMSVRK3NFXSKOSQZL3O7M6XDLISY342M7IAKR6W3M27TA", "Satellite ID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S", "Action": "GET_REPAIR"}
2022-04-01T10:13:09.182Z	ERROR	piecestore	download failed	{"Piece ID": "K4OTLEQTMSVRK3NFXSKOSQZL3O7M6XDLISY342M7IAKR6W3M27TA", "Satellite ID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S", "Action": "GET_REPAIR", "error": "write tcp 172.17.0.4:28967->5.161.92.121:40840: write: connection timed out", "errorVerbose": "write tcp 172.17.0.4:28967->5.161.92.121:40840: write: connection timed out\n\tstorj.io/drpc/drpcstream.(*Stream).rawWriteLocked:322\n\tstorj.io/drpc/drpcstream.(*Stream).MsgSend:400\n\tstorj.io/common/pb.(*drpcPiecestore_DownloadStream).Send:317\n\tstorj.io/storj/storagenode/piecestore.(*Endpoint).Download.func5.1:619\n\tstorj.io/common/rpc/rpctimeout.Run.func1:22"}
2022-04-01T10:17:07.954Z	INFO	piecestore	download started	{"Piece ID": "K4OTLEQTMSVRK3NFXSKOSQZL3O7M6XDLISY342M7IAKR6W3M27TA", "Satellite ID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S", "Action": "GET_REPAIR"}
2022-04-01T10:17:07.956Z	ERROR	piecestore	download failed	{"Piece ID": "K4OTLEQTMSVRK3NFXSKOSQZL3O7M6XDLISY342M7IAKR6W3M27TA", "Satellite ID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S", "Action": "GET_REPAIR", "error": "used serial already exists in store", "errorVerbose": "used serial already exists in store\n\tstorj.io/storj/storagenode/piecestore/usedserials.insertSerial:263\n\tstorj.io/storj/storagenode/piecestore/usedserials.(*Table).Add:117\n\tstorj.io/storj/storagenode/piecestore.(*Endpoint).verifyOrderLimit:76\n\tstorj.io/storj/storagenode/piecestore.(*Endpoint).Download:497\n\tstorj.io/common/pb.DRPCPiecestoreDescription.Method.func2:228\n\tstorj.io/drpc/drpcmux.(*Mux).HandleRPC:33\n\tstorj.io/common/rpc/rpctracing.(*Handler).HandleRPC:58\n\tstorj.io/drpc/drpcserver.(*Server).handleRPC:122\n\tstorj.io/drpc/drpcserver.(*Server).ServeOne:66\n\tstorj.io/drpc/drpcserver.(*Server).Serve.func2:112\n\tstorj.io/drpc/drpcctx.(*Tracker).track:52"}

I guess this will now have an impact on the suspension score (not yet visible).

thepaul · April 5, 2022, 5:02pm

An update:

I’ve been told that the satellites are not configured to make any outgoing QUIC connections at this time. So either the QUIC deployment timing is just a coincidence, or the increased network traffic from inbound QUIC connections made TCP connection timeouts increase, which resulted in the “errors from Close()” that trigger this bug.

The fix I’ve made should still address this problem, whether or not QUIC is involved, unless the cause is something dramatically different. That fix will probably be deployed to the satellites in ~6 more days.

littleskunk · April 5, 2022, 5:28pm

We don’t have a satellite deployment next week. Do you mean 13 days? Or a point release. That would also be an option.

thepaul · April 5, 2022, 5:30pm

Oh. I guess 13 days. I saw a deployment scheduled on the calendar. I admit I don’t know how our releases are managed.

Bivvo · April 5, 2022, 6:10pm

Thanks for the ongoing transparency, constant updates and support. @thepaul @littleskunk

Bivvo · April 9, 2022, 5:15pm

Version 1.52.2 @thepaul

2022-04-09T10:04:18.750Z      ERROR   piecestore      download failed {"Piece ID": "YN7SGRCKRHRH72DCYOO7UUTPC4HL3ZCDJRU6MPW2MKY6YITU554A", "Satellite ID": "1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE", "Action": "GET_REPAIR", "error": "write tcp 172.17.0.4:28967->5.161.73.234:36424: write: connection timed out", "errorVerbose": "write tcp 172.17.0.4:28967->5.161.73.234:36424: write: connection timed out
\tstorj.io/drpc/drpcstream.(*Stream).rawWriteLocked:322
\tstorj.io/drpc/drpcstream.(*Stream).MsgSend:400
\tstorj.io/common/pb.(*drpcPiecestore_DownloadStream).Send:317
\tstorj.io/storj/storagenode/piecestore.(*Endpoint).Download.func5.1:620
\tstorj.io/common/rpc/rpctimeout.Run.func1:22"}
2022-04-09T10:06:32.542Z        ERROR   piecestore      download failed {"Piece ID": "YN7SGRCKRHRH72DCYOO7UUTPC4HL3ZCDJRU6MPW2MKY6YITU554A", "Satellite ID": "1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE", "Action": "GET_REPAIR", "error": "used serial already exists in store", "errorVerbose": "used serial already exists in store
\tstorj.io/storj/storagenode/piecestore/usedserials.insertSerial:263
\tstorj.io/storj/storagenode/piecestore/usedserials.(*Table).Add:117
\tstorj.io/storj/storagenode/piecestore.(*Endpoint).verifyOrderLimit:76
\tstorj.io/storj/storagenode/piecestore.(*Endpoint).Download:498
\tstorj.io/common/pb.DRPCPiecestoreDescription.Method.func2:228
\tstorj.io/drpc/drpcmux.(*Mux).HandleRPC:33
\tstorj.io/common/rpc/rpctracing.(*Handler).HandleRPC:58
\tstorj.io/drpc/drpcserver.(*Server).handleRPC:122
\tstorj.io/drpc/drpcserver.(*Server).ServeOne:66
\tstorj.io/drpc/drpcserver.(*Server).Serve.func2:112
\tstorj.io/drpc/drpcctx.(*Tracker).track:52"}

nerdatwork · April 9, 2022, 5:27pm

@Bivvo Just a suggestion, its not helpful tagging devs in every post here and github. Anyone that has replied in this thread and everyone that has set the notifications to “watching” will receive a notification of your replies.

You should also read this thread so you dont report common messages like context cancelled

Bivvo · April 9, 2022, 6:01pm

Thx, no.

Thought it could be interesting for the community as well. But I’ve understood your remark and will not „spam devs“ anymore.

Alexey · April 10, 2022, 7:22am

Perhaps your router cannot keep up or your node can’t transfer a piece in a timely manner and thus the connection was closed. It also could be your network stack on this device. If you have a lot of such errors I suggest to re-check your network configuration, maybe you have some misconfigurations there.
Generally @nerdatwork is right, this is typical “context canceled” error due slowness of your node (any reason - slow disks, network packets drops, losing connections by your router, etc.).

Only the second error has some context for the current topic.

Bivvo · April 10, 2022, 7:59am

Yes, maybe.

Both has the same piece id. From what I’ve understood, the in this post relevant issue is a subsequent problem of the context cancelled one, at least paints the full picture: it is relevant to know the full history of a problem, no?

I’ve done a full select with that piece id, which underlines my theory:

docker logs sn1 2>&1 | grep 'YN7SGRCKRHRH72DCYOO7UUTPC4HL3ZCDJRU6MPW2MKY6YITU554A'
2022-04-09T10:01:32.529Z	INFO	piecestore	download started	{"Piece ID": "YN7SGRCKRHRH72DCYOO7UUTPC4HL3ZCDJRU6MPW2MKY6YITU554A", "Satellite ID": "1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE", "Action": "GET_REPAIR"}
2022-04-09T10:04:18.750Z	ERROR	piecestore	download failed	{"Piece ID": "YN7SGRCKRHRH72DCYOO7UUTPC4HL3ZCDJRU6MPW2MKY6YITU554A", "Satellite ID": "1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE", "Action": "GET_REPAIR", "error": "write tcp 172.17.0.4:28967->5.161.73.234:36424: write: connection timed out", "errorVerbose": "write tcp 172.17.0.4:28967->5.161.73.234:36424: write: connection timed out\n\tstorj.io/drpc/drpcstream.(*Stream).rawWriteLocked:322\n\tstorj.io/drpc/drpcstream.(*Stream).MsgSend:400\n\tstorj.io/common/pb.(*drpcPiecestore_DownloadStream).Send:317\n\tstorj.io/storj/storagenode/piecestore.(*Endpoint).Download.func5.1:620\n\tstorj.io/common/rpc/rpctimeout.Run.func1:22"}
2022-04-09T10:06:32.540Z	INFO	piecestore	download started	{"Piece ID": "YN7SGRCKRHRH72DCYOO7UUTPC4HL3ZCDJRU6MPW2MKY6YITU554A", "Satellite ID": "1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE", "Action": "GET_REPAIR"}
2022-04-09T10:06:32.542Z	ERROR	piecestore	download failed	{"Piece ID": "YN7SGRCKRHRH72DCYOO7UUTPC4HL3ZCDJRU6MPW2MKY6YITU554A", "Satellite ID": "1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE", "Action": "GET_REPAIR", "error": "used serial already exists in store", "errorVerbose": "used serial already exists in store\n\tstorj.io/storj/storagenode/piecestore/usedserials.insertSerial:263\n\tstorj.io/storj/storagenode/piecestore/usedserials.(*Table).Add:117\n\tstorj.io/storj/storagenode/piecestore.(*Endpoint).verifyOrderLimit:76\n\tstorj.io/storj/storagenode/piecestore.(*Endpoint).Download:498\n\tstorj.io/common/pb.DRPCPiecestoreDescription.Method.func2:228\n\tstorj.io/drpc/drpcmux.(*Mux).HandleRPC:33\n\tstorj.io/common/rpc/rpctracing.(*Handler).HandleRPC:58\n\tstorj.io/drpc/drpcserver.(*Server).handleRPC:122\n\tstorj.io/drpc/drpcserver.(*Server).ServeOne:66\n\tstorj.io/drpc/drpcserver.(*Server).Serve.func2:112\n\tstorj.io/drpc/drpcctx.(*Tracker).track:52"}

But as said, I’ll post that on GitHub only and won’t bother you here.

SGC · April 10, 2022, 10:00am

can’t remember if it can go to actual 100% for me… been a long while since i attempted that.
but for all practical intents and purposes, if your setup is performing well enough the successrate should go to about 100%

haven’t managed to find any of those serial errors myself…
so not sure what those are about.
here is my successrates for today.

========== AUDIT ==============
Critically failed:     0
Critical Fail Rate:    0.000%
Recoverable failed:    0
Recoverable Fail Rate: 0.000%
Successful:            574
Success Rate:          100.000%
========== DOWNLOAD ===========
Failed:                3
Fail Rate:             0.029%
Canceled:              21
Cancel Rate:           0.206%
Successful:            10167
Success Rate:          99.764%
========== UPLOAD =============
Rejected:              0
Acceptance Rate:       100.000%
---------- accepted -----------
Failed:                1
Fail Rate:             0.005%
Canceled:              23
Cancel Rate:           0.111%
Successful:            20774
Success Rate:          99.885%
========== REPAIR DOWNLOAD ====
Failed:                2
Fail Rate:             0.013%
Canceled:              0
Cancel Rate:           0.000%
Successful:            15496
Success Rate:          99.987%
========== REPAIR UPLOAD ======
Failed:                0
Fail Rate:             0.000%
Canceled:              0
Cancel Rate:           0.000%
Successful:            6380
Success Rate:          100.000%
========== DELETE =============
Failed:                0
Fail Rate:             0.000%
Successful:            196
Success Rate:          100.000%

ofc it won’t always look like that… had an 88% downloaded when there was an egress peak recently…
which is kinda terrible i know… but also had other issues i was trying to optimized and troubleshoot my way out of… so maybe that played a factor.

Bivvo · April 10, 2022, 10:45am

Thx @SGC. It was already confirmed by devs, that it is an issue with the QUIC libs and there was a fix with version 1.52.2, which did not help yet. It is about the reusage of a serial with TCP and QUIC, where one single serial should only be used once. Here is a full description.

The same appears, from my understanding, to this one, as referenced here. It is not proofed at all, that those error logs have any remarkable impact on the scores, as they do not occur that often.

Anyway, I can confirm your stats, mine are similar. I can also confirm the drop in download successes, which went to 89.xx% last days (included in the log selection below).

========== AUDIT ============== 
Critically failed:     0 
Critical Fail Rate:    0.000%
Recoverable failed:    0 
Recoverable Fail Rate: 0.000%
Successful:            549 
Success Rate:          100.000%
========== DOWNLOAD =========== 
Failed:                214 
Fail Rate:             0.391%
Canceled:              721 
Cancel Rate:           1.317%
Successful:            53827 
Success Rate:          98.293%
========== UPLOAD ============= 
Rejected:              0 
Acceptance Rate:       100.000%
---------- accepted ----------- 
Failed:                14 
Fail Rate:             0.009%
Canceled:              276 
Cancel Rate:           0.170%
Successful:            162466 
Success Rate:          99.822%
========== REPAIR DOWNLOAD ==== 
Failed:                3 
Fail Rate:             0.008%
Canceled:              1 
Cancel Rate:           0.003%
Successful:            35636 
Success Rate:          99.989%
========== REPAIR UPLOAD ====== 
Failed:                0 
Fail Rate:             0.000%
Canceled:              9 
Cancel Rate:           0.016%
Successful:            57803 
Success Rate:          99.984%
========== DELETE ============= 
Failed:                0 
Fail Rate:             0.000%
Successful:            156497 
Success Rate:          100.000%

Haha, recently another one occured, which is funny - as the download of the piece was twice (!!) successful, before the download was tried a third time and went into “used serial already exists in store”. I’ll post the logs on GitHub (only).

thepaul · April 10, 2022, 3:22pm

There is not a fix expected in the new storagenode version; the fix for this problem is in satellite code, which has a different deployment schedule.

To clarify, it was only supposition that QUIC was involved. I’ve since learned that the QUIC libs, at least, are unlikely to be at fault, because the satellites were not using outbound QUIC connections at all. It may still be related to QUIC, if for example the extra inbound QUIC connections are causing congestion, and the congestion makes this situation more likely to happen.

deathlessdd · April 10, 2022, 3:28pm

Im curious if this is a wide issue why doesnt more then one of my nodes have the same issue?

SGC · April 11, 2022, 8:03am

my storage is kinda special, fast ssd accelerated zfs with hdd’s in raidz1…
and i don’t see the serial thing, i duno if that is useful information, but i suppose it suggests that maybe it happens client side.?

mus6677 · April 11, 2022, 5:57pm

I still get the error after I disable UDP port forwarding making this node QUIC misconfigured.
i checked after received node suspension email


2022-04-11T21:48:09.385+0800	INFO	piecestore	download started	{"Piece ID": "GMEHY3YGXLGUWM7IBXMUIZTNQMX7KDYB3K7JCEIXDXEFTABJDVWA", "Satellite ID": "12rfG3sh9NCWiX3ivPjq2HtdLmbqCrvHVEzJubnzFzosMuawymB", "Action": "GET_REPAIR"}
2022-04-11T21:53:08.898+0800	INFO	piecestore	download canceled	{"Piece ID": "GMEHY3YGXLGUWM7IBXMUIZTNQMX7KDYB3K7JCEIXDXEFTABJDVWA", "Satellite ID": "12rfG3sh9NCWiX3ivPjq2HtdLmbqCrvHVEzJubnzFzosMuawymB", "Action": "GET_REPAIR"}
2022-04-11T21:53:09.398+0800	INFO	piecestore	download started	{"Piece ID": "GMEHY3YGXLGUWM7IBXMUIZTNQMX7KDYB3K7JCEIXDXEFTABJDVWA", "Satellite ID": "12rfG3sh9NCWiX3ivPjq2HtdLmbqCrvHVEzJubnzFzosMuawymB", "Action": "GET_REPAIR"}
2022-04-11T21:53:09.398+0800	ERROR	piecestore	download failed	{"Piece ID": "GMEHY3YGXLGUWM7IBXMUIZTNQMX7KDYB3K7JCEIXDXEFTABJDVWA", "Satellite ID": "12rfG3sh9NCWiX3ivPjq2HtdLmbqCrvHVEzJubnzFzosMuawymB", "Action": "GET_REPAIR", "error": "used serial already exists in store", "errorVerbose": "used serial already exists in store\n\tstorj.io/storj/storagenode/piecestore/usedserials.insertSerial:263\n\tstorj.io/storj/storagenode/piecestore/usedserials.(*Table).Add:117\n\tstorj.io/storj/storagenode/piecestore.(*Endpoint).verifyOrderLimit:76\n\tstorj.io/storj/storagenode/piecestore.(*Endpoint).Download:498\n\tstorj.io/common/pb.DRPCPiecestoreDescription.Method.func2:228\n\tstorj.io/drpc/drpcmux.(*Mux).HandleRPC:33\n\tstorj.io/common/rpc/rpctracing.(*Handler).HandleRPC:58\n\tstorj.io/drpc/drpcserver.(*Server).handleRPC:122\n\tstorj.io/drpc/drpcserver.(*Server).ServeOne:66\n\tstorj.io/drpc/drpcserver.(*Server).Serve.func2:112\n\tstorj.io/drpc/drpcctx.(*Tracker).track:52"}

Bivvo · April 11, 2022, 9:07pm

Smaller node i guess?

mus6677 · April 11, 2022, 9:22pm

not that small, 5.6TB full node.
i got email suspension on europe-north-1. but by the time i manage to check the node dashboard, the suspension score already wnet up to 98%.

Bivvo · April 11, 2022, 9:30pm

Strange, one single occurrence and suspended? Hmm. Anything else from the logs?