"GET_REPAIR”, “error”: “used serial already exists in store”

Craig · June 29, 2022, 7:44pm

Tracked down iostat in the sysstat package and got it installed. Yeah, there were some values in red there.

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           8.78    0.00    5.55   62.93    0.00   22.74

Device            r/s     w/s     rkB/s     wkB/s   rrqm/s   wrqm/s  %rrqm  %wrqm r_await w_await aqu-sz rareq-sz wareq-sz  svctm  %util
mmcblk0          0.80    7.20     12.00     30.40     0.00     0.40   0.00   5.26    1.50    6.72   0.05    15.00     4.22   3.50   2.80
sda              0.00    0.00      0.00      0.00     0.00     0.00   0.00   0.00    0.00    0.00   0.00     0.00     0.00   0.00   0.00
sdd             51.60    4.20   2398.40    140.00   478.80    14.60  90.27  77.66   33.08   35.19   1.91    46.48    33.33  17.89  99.80
sdb             45.80    0.00   2664.80      0.00   624.20     0.00  93.16   0.00   35.74    0.00   1.64    58.18     0.00  21.79  99.80
sdc             45.60    2.20   2624.00     33.60   611.80     5.00  93.06  69.44   35.62   36.18   1.73    57.54    15.27  20.96 100.20

(Values I can’t make show in red are: sdd %rrqm, %wrqm, and %util; sdb %rrqm and %util; sdc %rrqm and %util with %wrqm in purple)

Swap is only 100M which would seem to be the RPi OS default based on this. With such minimal space already allocated there and to try and avoid burn up of my microSD I’m wondering if I can get away with disabling it entirely.

Bivvo · June 29, 2022, 7:47pm

yes you can - did it, too:

Pentium100 · June 29, 2022, 7:59pm

%util is at 100%, those drives are at full I/O load.
%iowait is 63%, another indication of that - the CPU is spending 63% of the time waiting for a hard drive to read data,

Craig · June 29, 2022, 8:39pm

Yeah, I’m wondering if the filewalk is trying to run across all three nodes since I just came back up from a power outage earlier today. I’m fairly sure when I checked the system around when I had the GET_REPAIR errors that the load average was much different, so I think the current stats are not normal activity. Thanks for the insight on the io! The learning part is one of the things I’ve really enjoyed with taking part in Storj.

thepaul · June 29, 2022, 9:12pm

To clarify, I don’t think the error means that the node is running too slow, or that there are necessarily any performance problems. It’s something we need to fix on our end, which just happens to manifest more frequently when a node is running more slowly (perhaps very slightly more slowly).

Bivvo · July 10, 2022, 7:52pm

pi@pi:~ $ cat /mnt/WD1003/logs/sn1.log | grep "WQH66MCMFMYNDV3FSAP4B2ZO5XD43IXMLRA57PCSJAM7PNRQL2JA"
2022-07-10T10:03:57.159Z	INFO	piecestore	download started	{"Process": "storagenode", "Piece ID": "WQH66MCMFMYNDV3FSAP4B2ZO5XD43IXMLRA57PCSJAM7PNRQL2JA", "Satellite ID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S", "Action": "GET_REPAIR"}
2022-07-10T10:04:59.713Z	INFO	piecestore	downloaded	{"Process": "storagenode", "Piece ID": "WQH66MCMFMYNDV3FSAP4B2ZO5XD43IXMLRA57PCSJAM7PNRQL2JA", "Satellite ID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S", "Action": "GET_REPAIR"}
2022-07-10T10:08:57.284Z	INFO	piecestore	download started	{"Process": "storagenode", "Piece ID": "WQH66MCMFMYNDV3FSAP4B2ZO5XD43IXMLRA57PCSJAM7PNRQL2JA", "Satellite ID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S", "Action": "GET_REPAIR"}
2022-07-10T10:08:57.287Z	ERROR	piecestore	download failed	{"Process": "storagenode", "Piece ID": "WQH66MCMFMYNDV3FSAP4B2ZO5XD43IXMLRA57PCSJAM7PNRQL2JA", "Satellite ID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S", "Action": "GET_REPAIR", "error": "used serial already exists in store", "errorVerbose": "used serial already exists in store\n\tstorj.io/storj/storagenode/piecestore/usedserials.insertSerial:263\n\tstorj.io/storj/storagenode/piecestore/usedserials.(*Table).Add:117\n\tstorj.io/storj/storagenode/piecestore.(*Endpoint).verifyOrderLimit:76\n\tstorj.io/storj/storagenode/piecestore.(*Endpoint).Download:498\n\tstorj.io/common/pb.DRPCPiecestoreDescription.Method.func2:228\n\tstorj.io/drpc/drpcmux.(*Mux).HandleRPC:33\n\tstorj.io/common/rpc/rpctracing.(*Handler).HandleRPC:58\n\tstorj.io/drpc/drpcserver.(*Server).handleRPC:122\n\tstorj.io/drpc/drpcserver.(*Server).ServeOne:66\n\tstorj.io/drpc/drpcserver.(*Server).Serve.func2:112\n\tstorj.io/drpc/drpcctx.(*Tracker).track:52"}
pi@pi:~ $ cat /mnt/WD1003/logs/sn1.log | awk '$0~/used serial already exists in store/ {sub(/T.*/,"");a[$0]++} END {for(d in a){print d,a[d]}}' | sort
2022-07-10 1

Happening very rarely at the moment (once in 2 weeks or so), but again a case, where the download was successful before.

thepaul · July 13, 2022, 4:33am

Everything I can find indicates that the problem area is still the same as where I tried to fix it before. I now believe the prior fix just didn’t go far enough. This time we’ll make more certain that the first connection fails during the dial procedure before we attempt reconnection. Otherwise, if the dialing succeeded, retrying won’t do any good, and will instead lead to this error if the first connection got far enough to submit the order.

I have a change pending, awaiting review, that I hope will address the issue all the way.

clearing · July 24, 2022, 1:05pm

hi, just today, i started getting this error and got suspended. what action should i do to resolve this?

Bivvo · July 24, 2022, 1:28pm

nothing. keep your score in watch, but it should resolve itself soon.

thepaul · July 26, 2022, 8:59pm

There has been a satellite deployment with the above second fix! Please post here or on the github issue if you continue to see this error.

gronis93 · December 10, 2022, 3:51am

I got this error on two of my nodes. It seems to happen once per day around nighttime (maybe that is relevant) The oldest entry is from 29th of november so probably (v1.67.3) update.

NOTE 1: The error might have been present for longer since I recently setup persistent logs.

NOTE 2: The nodes have been having hardware issues with disks disconnecting and kernel hangups (which is fixed since 1 month back). Not sure if this could be related to my issues. I also have errors of this kind "error": "pieces error: filestore error: file does not exist" which I very much think is related to my previous hardware problems.

gronis93 · December 30, 2022, 10:13pm

Nevermind. The errors stopped once I fixed another (software) issue causing slow responses. Now the node seems to operate as expected. No further issues.