ERROR piecestore download failed something trust rpc something more

SGC · November 30, 2020, 4:10pm

I’m getting this error when i reboot my node, pops up a lot first a little while and then it doesn’t seem to show up any more… anyone got any ideas what this is about, i don’t have any checksum errors recorded anywhere, so there isn’t any data corruption and redundant if there was… also using ecc memory, and pretty sure i haven’t lost a file…
i also haven’t gotten an audit failed during the last week or in the last 4 + months…

anyways i duno what it means… it says something about trust with the satellite or something…

the error goes like this… seems to always be the same satellite i think…

2020-11-30T15:45:50.712Z ERROR piecestore download failed {“Piece ID”: “BR36ADYJLLI6KSYKBTK3F7FNQXL4T63FB75Z4AWNYK6JBXVDDRHQ”, “Satellite ID”: “121RTSDpyNZVcEU84Ticf2L1ntiuUimbWgfATz21tuvgk3vzoA6”, “Action”: “GET”, “error”: “trust: rpc: context canceled”, “errorVerbose”: “trust: rpc: context canceled\n\tstorj.io/common/rpc.TCPConnector.DialContext:93\n\tstorj.io/common/rpc.Dialer.dialEncryptedConn:175\n\tstorj.io/common/rpc.Dialer.DialNodeURL.func1:96\n\tstorj.io/common/rpc/rpcpool.(*Pool).Get:87\n\tstorj.io/common/rpc.Dialer.dialPool:141\n\tstorj.io/common/rpc.Dialer.DialNodeURL:95\n\tstorj.io/storj/storagenode/trust.Dialer.func1:51\n\tstorj.io/storj/storagenode/trust.IdentityResolverFunc.ResolveIdentity:43\n\tstorj.io/storj/storagenode/trust.(*Pool).GetSignee:143\n\tstorj.io/storj/storagenode/piecestore.(*Endpoint).VerifyOrderLimitSignature:134\n\tstorj.io/storj/storagenode/piecestore.(*Endpoint).verifyOrderLimit:62\n\tstorj.io/storj/storagenode/piecestore.(*Endpoint).Download:462\n\tstorj.io/common/pb.DRPCPiecestoreDescription.Method.func2:1004\n\tstorj.io/drpc/drpcmux.(*Mux).HandleRPC:29\n\tstorj.io/common/rpc/rpctracing.(*Handler).HandleRPC:58\n\tstorj.io/drpc/drpcserver.(*Server).handleRPC:111\n\tstorj.io/drpc/drpcserver.(*Server).ServeOne:62\n\tstorj.io/drpc/drpcserver.(*Server).Serve.func2:99\n\tstorj.io/drpc/drpcctx.(*Tracker).track:51”}

SGC · November 30, 2020, 4:12pm

oh i can see my smaller node does the same… so i guess it’s one of those usual errors we should just learn to ignore… maybe`?

kalloritis · November 30, 2020, 5:07pm

Pulled from the errors sticky:

Possible that it just take a bit for RAhead and ARC to get performant.

SGC · November 30, 2020, 5:30pm

only one that is close in the list is that one, and it’s not the same… kinda close but not the same error message.

ERROR server rpc error: code = PermissionDenied desc = info requested from untrusted peer

@kalloritis got plenty of iops to spare, and the same happens for the same satellite for my small less than 100gb storagenode that is located on a 2x 2way mirror, it’s basically done booting before it starts lol

Alexey · November 30, 2020, 9:58pm

This is a context canceled error, aka long tail cancellation.

SGC · November 30, 2020, 10:13pm

its just weird… i get like 8 of them when i boot the storagenode, but else i won’t get them…

so outside issue gotcha, partly because it’s a long way to asia from here and maybe my networking algorithm just needs to figure out how to best send stuff down there.

just had that trust thing and it was an error, so i wanted to make sure it wasn’t a real problem.

it’s not always easy to make heads or tails of this stuff even tho i really try.

thanks for the help.

Alexey · December 1, 2020, 7:38am

The algorithm of the Storj network when someone upload (download): select any 110 (35) nodes, start uploads (downloads), when the first 80 (29) are finished - cancel all others.
Simple and effective.

SGC · December 1, 2020, 11:28am

i’m running BBR on my network congestion algorithm (i think it was called) which to my understanding, will cache good routes to improve connections and gradually get smarter the longer it runs… at the cost of RAM usage for the different routing tables or whatever it uses… (was supposedly the 2nd or 3rd best one currently being used) but it was the only one i could figure out how to install/configure on my server.
it’s developed by google

it seems’s that this only happens for the satellite in asia and only a few times when i start the storagenode, which is what i find weird about it… else i run 99.9 to 99.5% successrates, aside from the hour after my server does a hard reboot, then all my caches needs to warm up, but even then i usually get like 95-98% successrates or better across the board…

so i just don’t understand why it happens for that particular satellite only… could the great firewall come into play?

DENIED DENIED ( AI verifies the connection) ACCEPTED ACCEPTED ACCEPTED
i duno, it just seems weird and i cannot imagine it’s a problem at my end since none of the other satellites nor other nodes on other satellites shows the issue… but all nodes on that particular satellite will give this error a few times shortly after starting the storagenode.

it just seems weird but ofc it doesn’t matter since it’s not anything important.

Alexey · December 1, 2020, 7:34pm

So, the “context canceled” error is clear now?

The second problem with your uptime could be related to any kind of your modifications and maintenances.
Each satellite come to audit your node independently. The online score falling when the satellite coming to audit your node but the node does not answering on request. In that moment your node considered as offline.

SGC · December 1, 2020, 7:49pm

only does like a few of them to the asia satellite on starting up within the first 30 sec…
i do have an error from time to time like a couple a day, less than ten…

my uptime is pretty good, tho i did change my isp the other day and they reset the router so my port forwarding got lost, which ended up giving me a full night of downtime, else i was doing a bit of testing and trying to fix a problem which was about 20 minutes of dt…

my uptime has been near exemplary for months, so yeah my uptime score isn’t perfect right now… and it will most likely suffer a couple of brief downtimes still before i’m done restructuring my entire local infrastructure… but no uptime problem aside from what i basically caused myself and didn’t notice because i aren’t tracking my stuff good enough

so no real issue i need to look at with my offline time, i know exactly what happened…
could downtime cause context errors during a storagenode start up?

because that might make sense… i think i’ve only started to notice it after i took a bit of downtime a few days ago… that would actually make a whole lot of sense…

i could go back and check my logs, i got pretty extensive and very nicely sorted logging

maybe some old orders or something it tries to send when starting the node perhaps?

Alexey · December 1, 2020, 7:53pm

No. The “context canceled” error is happen when the user or its uplink cancelling transfer for any reason. Usually - if your node too slow for the particular user.

The downtime is affecting your online score. That’s all.

SGC · December 1, 2020, 8:33pm

okay forget about the online score and downtime… not sure why you brought it up in the first place, figured you might have checked on my node and so i wanted to explain it.

the context cancelled answer tho doesn’t really ring true… there are some things that makes no sense…

i get a sequence of them immediately after starting up my node and then it just doesn’t happen, that 0.01% and i get the same 0.01% on all the other satellites even when i start up the node.

however asia gives me like 8 quick ones in a row, and the other nodes also get it… just not as many but they are tiny nodes.

if it was users then it would be random and also on other satellites… ofc latency to asia maybe different… hell it could have been all the shopping traffic that affected the connection to asia which might need my BBR algorithm to adapt to new routing.

0 context errors for 20 hours or whatever.

./successrate.sh sn1-2020-12-01.log
========== AUDIT ==============
Critically failed:     0
Critical Fail Rate:    0.000%
Recoverable failed:    0
Recoverable Fail Rate: 0.000%
Successful:            655
Success Rate:          100.000%
========== DOWNLOAD ===========
Failed:                0
Fail Rate:             0.000%
Canceled:              18
Cancel Rate:           0.100%
Successful:            18004
Success Rate:          99.900%
========== UPLOAD =============
Rejected:              0
Acceptance Rate:       100.000%
---------- accepted -----------
Failed:                0
Fail Rate:             0.000%
Canceled:              10
Cancel Rate:           0.675%
Successful:            1471
Success Rate:          99.325%
========== REPAIR DOWNLOAD ====
Failed:                0
Fail Rate:             0.000%
Canceled:              0
Cancel Rate:           0.000%
Successful:            30159
Success Rate:          100.000%
========== REPAIR UPLOAD ======
Failed:                0
Fail Rate:             0.000%
Canceled:              1
Cancel Rate:           0.025%
Successful:            3995
Success Rate:          99.975%
========== DELETE =============
Failed:                0
Fail Rate:             0.000%
Successful:            3638
Success Rate:          100.000%

this is from when i rebooted the node yesterday
the 10 download errors are all from the asia satellite and within the first 30sec to 1 minute of starting it, all coming in rapid succession of one another, i know my upload successrate is pretty low here, it seems ingress was quite low and tho the server is quick, and then some people have better setups or are better located in regard to latency… can’t win all the time hehe

./successrate.sh sn1-2020-11-30.log
========== AUDIT ==============
Critically failed:     0
Critical Fail Rate:    0.000%
Recoverable failed:    0
Recoverable Fail Rate: 0.000%
Successful:            736
Success Rate:          100.000%
========== DOWNLOAD ===========
Failed:                10
Fail Rate:             0.079%
Canceled:              5
Cancel Rate:           0.039%
Successful:            12721
Success Rate:          99.882%
========== UPLOAD =============
Rejected:              0
Acceptance Rate:       100.000%
---------- accepted -----------
Failed:                0
Fail Rate:             0.000%
Canceled:              13
Cancel Rate:           1.327%
Successful:            967
Success Rate:          98.674%
========== REPAIR DOWNLOAD ====
Failed:                0
Fail Rate:             0.000%
Canceled:              1
Cancel Rate:           0.008%
Successful:            12642
Success Rate:          99.992%
========== REPAIR UPLOAD ======
Failed:                0
Fail Rate:             0.000%
Canceled:              2
Cancel Rate:           0.098%
Successful:            2035
Success Rate:          99.902%
========== DELETE =============
Failed:                0
Fail Rate:             0.000%
Successful:            2679
Success Rate:          100.000%

well just not use to seeing errors in my logs any more, and i couldn’t find them in the reference stuff on the forum… and it’s looked odd all coming from the same satellite.

okay so i dug into my logs… seems it only happened once and doesn’t happen now…
the cause was my internet connection was broken when my router got reset by my isp.

i shut down the node before figuring out my portwarding router configuration was lost…fixed it.

then at node startup
it threw those 10 context errors for the asia satellite, i just thought i had seen it before…

might have seen it in the past if i didn’t disconnect correctly or something, maybe on my new nodes… have been doing some test on those.

does kinda makes me think it has to be unsent orders that was attempted transmitted… duno what else would make sense, would also make sense that asia would be the most prominent because of higher latency.

sorry for the confusion… i think this makes sense now…
thanks for you patience, duno if you agree with me… not sure i need you to either lol
i like having answers, also doesn’t have to be perfect answers.

it’s better to be roughly right, than precisely wrong…
thanks again

SGC · December 1, 2020, 8:41pm

how about this
full November successrates and more than 25% of my context errors happened within 30sec to 1min when i reconnected after internet was disrupted…
the others was because i had a drive that was acting up, which is now fixed.
thats why the dl successrate was getting below 99.9%

./successrate.sh sn1-2020-11-*.log
========== AUDIT ==============
Critically failed:     0
Critical Fail Rate:    0.000%
Recoverable failed:    0
Recoverable Fail Rate: 0.000%
Successful:            34303
Success Rate:          100.000%
========== DOWNLOAD ===========
Failed:                38
Fail Rate:             0.013%
Canceled:              5555
Cancel Rate:           1.881%
Successful:            289669
Success Rate:          98.106%
========== UPLOAD =============
Rejected:              0
Acceptance Rate:       100.000%
---------- accepted -----------
Failed:                3
Fail Rate:             0.001%
Canceled:              1439
Cancel Rate:           0.322%
Successful:            445020
Success Rate:          99.677%
========== REPAIR DOWNLOAD ====
Failed:                17
Fail Rate:             0.005%
Canceled:              1
Cancel Rate:           0.000%
Successful:            353009
Success Rate:          99.995%
========== REPAIR UPLOAD ======
Failed:                0
Fail Rate:             0.000%
Canceled:              165
Cancel Rate:           0.346%
Successful:            47572
Success Rate:          99.654%
========== DELETE =============
Failed:                0
Fail Rate:             0.000%
Successful:            562975
Success Rate:          100.000%

deathlessdd · December 1, 2020, 8:47pm

When you overthink about something you really over think it and over complicate it to the point where it doesn’t make sense anymore.

SGC · December 1, 2020, 10:43pm

i think my conclusion makes pretty good sense, tho there is an argument to be made for why i should care in the first place… it was an oddity, and initially i was worried something might be wrong.

Alexey · December 2, 2020, 7:00am

The “context canceled” is happened when the customer cancel the upload/download.
There are several reasons, why it could happen:

long tail cancelation;
network issues;
manual cancel.

When your node starting, it’s too slow to answer - thus got “context canceled”. That’s all. I think you do not need to invent other reasons, except for fun.