"GET_REPAIR”, “error”: “used serial already exists in store”

For the month of June:

Node 1 (3.55 TB, full)

2022-06-12 1
2022-06-17 1
2022-06-24 1

Node 2 (1.74 TB, full)

2022-06-21 1

I could go back further, although I am missing some logs from May so it would be an incomplete record.

1 Like

I ran this on my log files starting 2022-04-07 (I have older log files, if you need it I can run in on those as well)

2022-04-08 10
2022-04-09 19
2022-04-10 24
2022-04-11 14
2022-06-14 1

Any suggestion on how to quickly find out? I guess I could query the db’s, but I don’t want to lock the databases while the node is running.

I’m trying to think back though. I did some large data migrations around that time. It’s possible I created a fairly big IO bottleneck which may have impacted things. I feel like this odd outlier is more likely to be because of a special scenario on my end than high traffic alone.

You could tweak my command above so that it only looks for ERROR instead of used serial already exists in store. That would give a count of errors per day, which would probably go up if you had higher traffic or an unrelated local problem.

That’s running now, though that might take a while since it’s a large log. But I would also see more errors if there was indeed an IO bottleneck. Simply because of errors caused by timeouts. So it won’t really be conclusive.

Done!

2022-04-24 32
2022-04-25 29
2022-04-26 33
2022-04-27 114
2022-04-28 43
2022-04-29 37
2022-04-30 19
2022-05-01 185
2022-05-02 209
2022-05-03 102
2022-05-04 103
2022-05-05 121
2022-05-06 26
2022-05-07 11
2022-05-08 20
2022-05-09 36
2022-05-10 49
2022-05-11 140
2022-05-12 31
2022-05-13 21
2022-05-14 17
2022-05-15 32
2022-05-16 21

This does seem to be during the time I was doing data migrations. I snipped the relevant rows. The rest looks fairly normal around 20-50 a day.
Though also a recent peak.

2022-06-22 51
2022-06-23 34
2022-06-24 63
2022-06-25 210
2022-06-26 89
2022-06-27 87
2022-06-28 45

Don’t think I was doing anything special that day…

Now that I think back about it, I’m almost certain the 5th was during the migrations I was doing. I think I even saw the error and figured it’s not too big a deal since others were seeing it too and it’s probably temporary due to high load anyway. I was kind of busy at the time so didn’t really give it another thought.

I can provide the full log if that’s useful.

2022-06-19 2
2022-06-20 1
2022-06-21 4

Unfortunately logging has so much increased last weeks, that my 100m x3 only lasts for 43h … older logs are gone and I need to rethink how to store the logs in an efficient way elsewhere.

Try running that same command with the capital T changed to a colon :. That should give a count per-hour, which is still useful.

On my two bigger nodes, for june:

2.3TB node running on a weak 2.5" SMR drive:

2022-05-31 3
2022-06-01 12
2022-06-02 8
2022-06-03 8
2022-06-04 8
2022-06-05 13
2022-06-06 11
2022-06-07 8
2022-06-08 10
2022-06-09 6
2022-06-10 3
2022-06-11 2
2022-06-12 3
2022-06-13 3
2022-06-14 3
2022-06-15 6
2022-06-16 7
2022-06-17 2
2022-06-18 8
2022-06-19 7
2022-06-20 2
2022-06-21 2
2022-06-22 3
2022-06-23 3
2022-06-24 3
2022-06-25 4
2022-06-26 2
2022-06-27 3
2022-06-28 1

2.5TB node running on a kinda more performant 3.5" drive, but still SMR:

2022-06-01 1
2022-06-03 1
2022-06-07 1
2022-06-15 1
2022-06-22 1
2022-06-27 1
2022-04-09 6
2022-04-10 7
2022-04-11 7
2022-04-21 2
2022-04-22 1
2022-05-21 1
2022-06-11 1
2022-06-21 2

Grepped a year worth of logs of nodes storing ~13TB. Mostly older data though, I was out of free space and had barely any ingress for many months. Non-SMR 3.5" drives. It’s a bit curious how these events clump around 10th and 21th of these months… I do not recall anything specific that could have impacted nodes on these days.

From my 102 nodes combined :

root@server030:/disk102/storj/logs/storagenode# cat server*.log ./old/server*.log | awk '$0~/used serial already exists in store/ {sub(/T.*/,"");a[$0]++} END {for(d in a){print d,a[d]}}' | sort
2022-03-28 21
2022-03-29 2
2022-04-05 1
2022-04-06 2
2022-04-07 2
2022-04-08 2
2022-04-09 4
2022-04-10 11
2022-04-11 11
2022-04-12 7
2022-04-15 2
2022-04-20 53
2022-04-23 1
2022-04-24 5
2022-04-27 1
2022-04-28 4
2022-04-29 6
2022-04-30 2
2022-05-01 2
2022-05-02 1
2022-05-06 1
2022-05-10 9
2022-05-11 4
2022-05-12 1
2022-05-13 3
2022-05-14 5
2022-05-16 1
2022-05-19 1
2022-05-23 6
2022-05-27 5
2022-05-31 2
2022-06-02 2
2022-06-03 3
2022-06-05 1
2022-06-06 2
2022-06-10 16
2022-06-14 2
2022-06-15 3
2022-06-17 1
2022-06-19 1
2022-06-20 8
2022-06-21 8
2022-06-25 2
2022-06-27 4
2022-06-28 1

th3van.dk

1 Like

Thank you!

With that many nodes, it might be helpful to know if the counts tend to clump on a particular set of nodes, or if each one has the error at about the same rate. Is that something you can tell?

Are all of the nodes identical in hardware or provisioning?

grep -c "used serial already exists in store" 2022-06-*-sn0003.log
2022-06-01-sn0003.log:0
2022-06-02-sn0003.log:0
2022-06-03-sn0003.log:0
2022-06-04-sn0003.log:0
2022-06-05-sn0003.log:0
2022-06-06-sn0003.log:0
2022-06-07-sn0003.log:0
2022-06-08-sn0003.log:0
2022-06-09-sn0003.log:0
2022-06-10-sn0003.log:0
2022-06-11-sn0003.log:0
2022-06-12-sn0003.log:0
2022-06-13-sn0003.log:0
2022-06-14-sn0003.log:0
2022-06-15-sn0003.log:0
2022-06-16-sn0003.log:0
2022-06-17-sn0003.log:0
2022-06-18-sn0003.log:0
2022-06-19-sn0003.log:0
2022-06-20-sn0003.log:0
2022-06-21-sn0003.log:0
2022-06-22-sn0003.log:0
2022-06-23-sn0003.log:0
2022-06-24-sn0003.log:0
2022-06-25-sn0003.log:0
2022-06-26-sn0003.log:0
2022-06-27-sn0003.log:0
2022-06-28-sn0003.log:0
2022-06-29-sn0003.log:0

ofc with a result like that i ponder if my command worked… so just for reference

grep -c "Piece ID" 2022-06-*-sn0003.log
2022-06-01-sn0003.log:216487
2022-06-02-sn0003.log:229853
2022-06-03-sn0003.log:235781
2022-06-04-sn0003.log:245527
2022-06-05-sn0003.log:218385
2022-06-06-sn0003.log:229059
2022-06-07-sn0003.log:225715
2022-06-08-sn0003.log:207170
2022-06-09-sn0003.log:227148
2022-06-10-sn0003.log:282787
2022-06-11-sn0003.log:349699
2022-06-12-sn0003.log:259303
2022-06-13-sn0003.log:108324
2022-06-14-sn0003.log:179542
2022-06-15-sn0003.log:156895
2022-06-16-sn0003.log:127013
2022-06-17-sn0003.log:118273
2022-06-18-sn0003.log:110154
2022-06-19-sn0003.log:216502
2022-06-20-sn0003.log:245144
2022-06-21-sn0003.log:259959
2022-06-22-sn0003.log:251251
2022-06-23-sn0003.log:229541
2022-06-24-sn0003.log:238729
2022-06-25-sn0003.log:232899
2022-06-26-sn0003.log:197043
2022-06-27-sn0003.log:194594
2022-06-28-sn0003.log:257216
2022-06-29-sn0003.log:14087

the command seems to work just fine…
also tried a cat | awk
with the same result

cat 2022-06-28-sn0003.log | awk '$0~/used serial already exists in store/ {sub(/T.*/,"");a[$0]++} END {for(d in a){print d,a[d]}}' | sort

but that just gives me no result, which i assume is equal 0 or null

my storage is rather special and elaborate, so maybe that has something to do with it…
am also running on debian in a container…

i doubt that matters much… using something called vfs for the docker storage drivers, which isn’t really recommended but it’s the only thing that i have had working well with my current setup.

might try to see if i can replicate the others results by moving a storagenode to a dedicated disk.

checked all the logs for the entire year and it’s all the same.

./successrate.sh 2022-06-*-sn0003.log
========== AUDIT ==============
Critically failed:     0
Critical Fail Rate:    0.000%
Recoverable failed:    0
Recoverable Fail Rate: 0.000%
Successful:            10413
Success Rate:          100.000%
========== DOWNLOAD ===========
Failed:                3393
Fail Rate:             0.327%
Canceled:              7831
Cancel Rate:           0.754%
Successful:            1027769
Success Rate:          98.920%
========== UPLOAD =============
Rejected:              4857
Acceptance Rate:       99.567%
---------- accepted -----------
Failed:                346
Fail Rate:             0.031%
Canceled:              8594
Cancel Rate:           0.769%
Successful:            1108853
Success Rate:          99.200%
========== REPAIR DOWNLOAD ====
Failed:                5
Fail Rate:             0.002%
Canceled:              1
Cancel Rate:           0.000%
Successful:            300151
Success Rate:          99.998%
========== REPAIR UPLOAD ======
Failed:                0
Fail Rate:             0.000%
Canceled:              14
Cancel Rate:           0.008%
Successful:            178980
Success Rate:          99.992%
========== DELETE =============
Failed:                0
Fail Rate:             0.000%
Successful:            15421
Success Rate:          100.000%

not like my successrates are perfect… excuse the rejected
but sometimes if the system is overloaded or such, it can get quite bad because the network will keep spamming uploads, so i capped it at 100 max concurrent.
i suppose i should raise that, to get the last 0.5%, just haven’t gotten around to that…

1 Like

Thanks for digging into this! My nodes have been clean for a while, but here is what I got running your command…

Node 1:

2022-05-17 80
2022-05-18 95
2022-05-26 121
2022-05-27 115
2022-05-28 94

Node 2:
nothing

Node 3:
nothing

All three of these are running on the same RPi, each with its own drive. Guessing the data may have rolled off. All nodes currently on 1.56.4.

2 Likes

Is any of your nodes on a SMR drive?

If the question was for me (or everybody), that’s negative in my rig.

2x WDC WD20EARS-00MVWB0, listed as “PMR” which is supposedly equivalent to CMR and certainly not SMR.

1x WDC WD80EFZZ-68BTXN0, purchased after I learned of CMR/SMR and specifically purchased because it is CMR.

EDIT: and I had these errors on all 3 of my nodes (drives) when I had initially reported.

Yeah It was a question for you, Just wanted to see if we had the same relation, I know we both use external enclosures so I dont know if that is a related factor or SMR so I just wanted to make sure. If it comes down to an IO issue or not.

I think we have enough info at this point to be pretty sure that the error is more likely to happen when the node is operating more slowly (whether that means limited CPU capacity, high CPU usage, high i/o load, etc). So there may be a race condition causing this effect. That’s where I’ll be looking first.

5 Likes

I will admit that is possible as I’m pushing three nodes on a single RPi3. But any time I’ve looked in htop the “instant” CPU is well within limits. The load average though seems to tell a different story but I can’t figure out why. I am only powered up for a bit over two hours though (had a power outage here that outlasted my UPS) so maybe there is something I’m missing (filewalker?) that is skewing the numbers.
Screenshot from 2022-06-29 14-47-32

Check if your io is not overloaded (iostat -x 5).
Memory usage looks weird to me - swap is completely used.

1 Like