Storage node to Satellite : I've lost this block, sorry

BrightSilence · August 17, 2020, 3:55pm

The score in the earnings calculator shows how close you are to disqualification. Which is very close. At 100% you’ll be disqualified. I suggest stopping your node while you debug, since you’re right on the edge of being disqualified.

If you have recently removed and recreated the container, your logs from before that are gone. I recommend redirecting the logs to a file to prevent losing vital information for debugging this issue. In the mean time, stop your node. Check all paths, check your identity. Make sure your databases are ok and check the file system.

Also, have you recently set up other nodes?

andrew2.hart · August 17, 2020, 4:45pm

As usual, these are all good sensible points. Thank you.
Yes, I have created more nodes here. There are 10 now on this pi 4.

BrightSilence · August 17, 2020, 5:22pm

Make sure you haven’t used the same identity on any of the new nodes or the same port. Either of those could lead to the wrong node receiving data. If you find a mistake like that, write down what went wrong and try to determine where data from this node might have ended up.

I can’t really think of anything else to check, better redirect those logs to make sure you actually have something to use for debugging in the future. We’re kind of diagnosing blind right now.

SGC · August 18, 2020, 7:38am

10 node… how many ram does that thing have… i mean if memory serves… xD no pun intended… then RPI 4 A has 1gb maybe and B has 4gb or something like that… if you are running 10 nodes on a 1gb it might be asking for trouble even if it isn’t what caused the current issue…

i would argue even on the 4gb version i wouldn’t feel to safe… yesterday because of issues… my node reached 2.1gb ram utilization… without that spare memory the node would run into some sort of trouble.

Success Rate: 88.230% this is usually a hint of high load on the drive, drives or system
i think the lowest i’ve ever seen mine is 94-96%

should be close to the 99% range in most cases… maybe a few % off
might not seem like much but i would say that could easily be a pointer towards what is wrong

andrew2.hart · August 18, 2020, 8:05am

Well, if you go back to the thread title, I think the problem is due to past bad sectors (on a drive that is not currently in use)
The other 9 nodes don’t have the problem. (5 of these were created recently testing create identity )

All I keep going back to is that all audits are good and nothing is logged about fails.

Maybe there were no bad sectors after all and these audit scores are just corrupt at the satellite end

SGC · August 18, 2020, 6:54pm

ah right sorry.

all hdd’s have smart data unless if its turned off… that should record any issues and write errors, not sure how well the logging is, but atleast for the most recent operation time there will be a log of issues…
else it keep a count of stuff like write errors spanning the life time of the disk… not aware if one can reset that…

very good way to see if the disk has issues or have had issue…

smartctl is pretty good, ofc its not always clearly visible if there are issues… then you should ofc benchmark it, smart also has some integrated tests one can use.

lots of stuff can cause bad data… it’s why server gear generally have redundancy on most stuff.
something so simple as a bad cable can sometimes gives errors on disk, even if it seems to run fine…

kevink recently had a file get corrupted, even tho he is running a raidz1 and there was no reason for why it should have happened… we ended up concluding that it was most likely a degradation of data in ram before it was written to disk.

it just goes to show how little it takes for an error to slip through, 300 bad files seems high tho…

since you say scrub… are you using zfs?
and how much ram does the RPI have?
and did replacing the hdd fix the issue?

i doubt it’s satellite related… or i mean i don’t think i’ve seen a bad file or failed audit, unless if it was a local hardware / configuration / user fault…

andrew2.hart · August 19, 2020, 5:21am

The smart error “unsafe shutdown” (something like that) were recorded in smart.

Yes, ECC ram is normal for Sun servers, where solaris and ZFS came from.

300 was an estimate, it was a few pages of errors when I copied the files to another disk

No, I don’t use ZFS. I imagine the scrub would consist of trying to read all the blobs are check the checksums that were sent with the data during ingress.

This RPI4 has 4GB but others are available 1GB, 2GB, 4GB, 8GB

The current disk does not have unsafe shutdown errors and I don’t think it is even valid for the new disk but the storagenode continues to get bad audit scores.

I don’t see how it can be the satellite at fault but perhaps, after the audit data is received, the satellite checks the data and finds that it is bad but does not inform the storagenode.

Alexey · August 19, 2020, 6:40am

The cryptography is guarantee that hash is match the piece, the order is signed by uplink and storagenode otherwise the order will be rejected after submitting and will not be accounted as valid.
The only possible situation if storagenode corrupted this piece and thus hash is wrong.
Please, provide the error from the log for failed audits.

andrew2.hart · August 19, 2020, 7:02am

There is no audit errors in the storagenode log. There are many successful audits in the log.

I agree that the storagenode has corrupted some data.

Alexey · August 19, 2020, 7:06am

Please, try to enable the debug level for logs and make sure that you have redirected them to the file to be able to check them in any time.

andrew2.hart · August 19, 2020, 7:39am

OK debug is enabled in config.yaml, what should I look out for?

root@raspberrypi1:/data/data1# grep DEBUG node.log | awk ‘{ print $2 " " $3 " " $4 }’ | sort -u
DEBUG Anonymized tracing
DEBUG blobscache deleted
DEBUG contact:chore Starting
DEBUG contact:endpoint pinged
DEBUG Initialized batcher
DEBUG orders cleaning
DEBUG orders cleanup
DEBUG orders sending
DEBUG pieces deleted
DEBUG pieces:trash starting
DEBUG servers started
DEBUG services started
DEBUG trust Fetched
DEBUG trust Satellite
DEBUG version Allowed
DEBUG Version info
DEBUG version Running

Alexey · August 19, 2020, 7:33pm

Something related to audits.
Is the Audit score recovering?

andrew2.hart · August 20, 2020, 5:32am

It recovered from 80%DQ to 10%DQ but now back to 40%DQ (using earnings,py figures)
Audit score is 81.8% by the dashboard. Still no audit errors logged

Alexey · August 20, 2020, 7:11am

Please, count started audit downloads and audits finished.
If you find audits that are started but never finished that will also count as audit failure after 3 retries.
This is mean your node failing audits because of continues timeouts. It could be disk (more likely) or your internet channel (rare, because exactly on audits).

andrew2.hart · August 20, 2020, 9:49am

I already tried matching counts of started and downloaded audits that’s in post 38 above.
Currently saltlake at
Status: WARNING >> Audit[52.9% DQ|0.0% Susp]
root@raspberrypi1:/data/data1# grep -i audit node.log | grep -i started | wc
619 8047 136535
root@raspberrypi1:/data/data1# grep -i audit node.log | grep -i downloaded | wc
619 7428 132821

andrew2.hart · August 20, 2020, 5:25pm

I think, that to improve knowledge and progress, a node has to die…
cd data9
root@raspberrypi1:/data/data9# python ~/storj_earnings/earnings.py
August 2020 (Version: 9.3.0) [snapshot: 2020-08-20 17:26:12Z]
TYPE PRICE DISK BANDWIDTH PAYOUT
Upload Ingress -not paid- 4.74 GB
Upload Repair Ingress -not paid- 27.32 GB
Download Egress 20 USD / TB 1.58 GB 0.03 USD
Download Repair Egress 10 USD / TB 2.32 MB 0.00 USD
Download Audit Egress 10 USD / TB 17.15 KB 0.00 USD
Disk Current Storage -not paid- 31.01 GB
Disk Average Month Storage 1.50 USD / TBm 5.08 GBm 0.01 USD
Disk Usage Storage -not paid- 3.71 TBh
_______________________________________________________________________________________________________+
Total 5.08 GBm 33.64 GB 0.04 USD
Estimated total by end of month 7.98 GBm 52.87 GB 0.06 USD

Payout and held amount by satellite:
SATELLITE MONTH JOINED HELD TOTAL EARNED HELD% HELD PAID
us-central-1 1 2020-08-09 0.00 USD 0.0022 USD 75% 0.0017 USD 0.0006 USD
Status: 11% Vetted >> Audit[0.0% DQ|0.0% Susp]
europe-west-1 1 2020-08-09 0.00 USD 0.0015 USD 75% 0.0012 USD 0.0004 USD
Status: 8% Vetted >> Audit[0.0% DQ|0.0% Susp]
europe-north-1 1 2020-08-09 0.00 USD 0.0004 USD 75% 0.0003 USD 0.0001 USD
Status: 14% Vetted >> Audit[0.0% DQ|0.0% Susp]
asia-east-1 1 2020-08-09 0.00 USD 0.0018 USD 75% 0.0014 USD 0.0005 USD
Status: 9% Vetted >> Audit[0.0% DQ|0.0% Susp]
saltlake 1 2020-08-09 0.00 USD 0.0332 USD 75% 0.0249 USD 0.0083 USD
Status: 18% Vetted >> Audit[0.0% DQ|0.0% Susp]
stefan-benten 1 2020-08-09 0.00 USD 0.0001 USD 75% 0.0001 USD 0.0000 USD
Status: 5% Vetted >> Audit[0.0% DQ|0.0% Susp]
_____________________________________________________________________________________________________+
TOTAL 0.00 USD 0.0393 USD 0.0295 USD 0.0098 USD

Direct logs to node.log, set debug and restart node …

root@raspberrypi1:/data/data9# docker stop -t 300 storagenode9
storagenode9
root@raspberrypi1:/data/data9# docker rm storagenode9
storagenode9
root@raspberrypi1:/data/data9# ~/go9.sh

ooo new version :
2020-08-20T17:29:09.452Z DEBUG version Running on allowed version. {“Version”: “1.10.1”}

cd storage/blobs
find . | wc
34163 34163 3582430
# find . -type f -exec dd if=/dev/zero of={} bs=1 count=1 seek=512 ; #lol

first error seen

2020-08-20T18:02:59.392Z	INFO	piecestore	download started	{“Piece ID”: “RLQP6ZWP5A67JF5XDCBY7IOTRUJ2C44HL7SRCRSJSVUSIUEBAWAA”, “Satellite ID”: “12rfG3sh9NCWiX3ivPjq2HtdLmbqCrvHVEzJubnzFzosMuawymB”, “Action”: “GET”}
2020-08-20T18:02:59.397Z	ERROR	piecestore	download failed	{“Piece ID”: “RLQP6ZWP5A67JF5XDCBY7IOTRUJ2C44HL7SRCRSJSVUSIUEBAWAA”, “Satellite ID”: “12rfG3sh9NCWiX3ivPjq2HtdLmbqCrvHVEzJubnzFzosMuawymB”, “Action”: “GET”, “error”: “requested more data than available, requesting=512 available=1”, “errorVerbose”: “requested more data than available, requesting=512 available=1\n\tstorj.io/common/rpc/rpcstatus.Errorf:87\n\tstorj.io/storj/storagenode/piecestore.(Endpoint).Download:545\n\tstorj.io/common/pb.DRPCPiecestoreDescription.Method.func2:1004\n\tstorj.io/drpc/drpcmux.(Mux).HandleRPC:107\n\tstorj.io/common/rpc/rpctracing.(Handler).HandleRPC:56\n\tstorj.io/drpc/drpcserver.(Server).handleRPC:111\n\tstorj.io/drpc/drpcserver.(Server).ServeOne:62\n\tstorj.io/drpc/drpcserver.(Server).Serve.func2:99\n\tstorj.io/drpc/drpcctx.(*Tracker).track:51”}

Pac · August 20, 2020, 9:34pm

I happen to have a recent node which lost as much as 5% of its files, because of a dying disk (or dying enclosure, not sure yet, tests are on-going).

Switched everything to a new disk, but rsync could copy only 95% of the files (that is roughly 40000 files out of ~42100 files - yes, more than 2000 files lost!).
(also, the config.yaml was only binary gibberish and I lost half of the databases after the copy was “complete”… Luckily these are recoverable)

3 satellites did notice something was off, as its Audit score went down to 80% for a couple of days, but it seems it is slowly going back up now, as it’s between 88% and 93% on those satellites.
Edit: In fact one of them went back down a bitrecently: to 85%. And all audit failures I see since the “rsync” operation are “file does not exist”, as expected…

I’m surprised this node is surviving this incident for the moment. We’ll see how it goes…

Once a file audit fails once, I guess the satellite is never going to audit this file again and it registers the fact that this node doesn’t have it anymore, right? If that’s the case, I guess the Node will have less and less chance to fail future audits, and it should be okay.

littleskunk · August 20, 2020, 9:59pm

Negative. The same file can get audited multiple times. The chance to hit the same file again is almost 0 for your storage node. You are simply holding too many pieces to hit it again.

What is slowly increasing your audit rate is the repair job. The audit system notices the lost pieces but we can tollerrate it. We trigger repair at 35 (currently 52) because we expect a low error rate exactly for the reasons you have mentioned. The repair job will notice that you can’t delieve the piece and while it is on it it will replace that piece. That way the missing pieces will be replaced over time and that also means less audit errors. As soon as the repair job touched all pieces you should get back to 100% audit rate.

andrew2.hart · August 21, 2020, 5:41am

Lessons learned

Truncated files do not cause audit failures

Here is the error in the log

2020-08-21T04:11:44.264Z	INFO	piecestore	download started	{“Piece ID”: “RSUABYEMSEBEUWYHF77AM65ZR5QLCR476VSKQTPHA7WJKTTA3ZIA”, “Satellite ID”: “12rfG3sh9NCWiX3ivPjq2HtdLmbqCrvHVEzJubnzFzosMuawymB”, “Action”: “GET_AUDIT”}
2020-08-21T04:11:44.287Z	ERROR	piecestore	download failed	{“Piece ID”: “RSUABYEMSEBEUWYHF77AM65ZR5QLCR476VSKQTPHA7WJKTTA3ZIA”, “Satellite ID”: “12rfG3sh9NCWiX3ivPjq2HtdLmbqCrvHVEzJubnzFzosMuawymB”, “Action”: “GET_AUDIT”, “error”: “requested more data than available, requesting=512 available=1”, “errorVerbose”: “requested more data than available, requesting=512 available=1\n\tstorj.io/common/rpc/rpcstatus.Errorf:87\n\tstorj.io/storj/storagenode/piecestore.(Endpoint).Download:545\n\tstorj.io/common/pb.DRPCPiecestoreDescription.Method.func2:1004\n\tstorj.io/drpc/drpcmux.(Mux).HandleRPC:107\n\tstorj.io/common/rpc/rpctracing.(Handler).HandleRPC:56\n\tstorj.io/drpc/drpcserver.(Server).handleRPC:111\n\tstorj.io/drpc/drpcserver.(Server).ServeOne:62\n\tstorj.io/drpc/drpcserver.(Server).Serve.func2:99\n\tstorj.io/drpc/drpcctx.(*Tracker).track:51”}

and here is the result
us-central-1 1 2020-08-09 0.00 USD 0.0022 USD 75% 0.0017 USD 0.0006 USD
Status: 11% Vetted >> Audit[0.0% DQ|0.0% Susp]
europe-west-1 1 2020-08-09 0.00 USD 0.0016 USD 75% 0.0012 USD 0.0004 USD
Status: 9% Vetted >> Audit[0.0% DQ|29.0% Susp]
europe-north-1 1 2020-08-09 0.00 USD 0.0004 USD 75% 0.0003 USD 0.0001 USD
Status: 14% Vetted >> Audit[0.0% DQ|22.3% Susp]
asia-east-1 1 2020-08-09 0.00 USD 0.0019 USD 75% 0.0014 USD 0.0005 USD
Status: 10% Vetted >> Audit[0.0% DQ|50.1% Susp]
saltlake 1 2020-08-09 0.00 USD 0.0335 USD 75% 0.0252 USD 0.0084 USD
Status: 18% Vetted >> Audit[0.0% DQ|0.0% Susp]
stefan-benten 1 2020-08-09 0.00 USD 0.0001 USD 75% 0.0001 USD 0.0000 USD
Status: 5% Vetted >> Audit[0.0% DQ|0.0% Susp]

Surprising that completely obliterating the data results in only suspension!

Using the wrong dbs copied from another node

2020-08-21T06:23:03.258Z ERROR orders.12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs rpc client error when receiving new order settlements {“error”: “order: failed to receive settlement response: only specified storage node can settle order”, “errorVerbose”: “order: failed to receive settlement response: only specified storage node can settle order\n\tstorj.io/storj/storagenode/orders.(*Service).settle:330\n\tstorj.io/storj/storagenode/orders.(*Service).Settle:221\n\tstorj.io/storj/storagenode/orders.(*Service).sendOrders.func2:200\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:57”}

Pac · August 21, 2020, 7:43am

Ah. Is that so.

So, what we’re saying here is that as long as a “segment” (made of 80 pieces if I’m not mistaken) does not fall under the repair threshold (52 pieces currently), the satellite is just “remembering” any audit errors as unhealthy/missing pieces, but may still audit them (in case they are available again on Nodes’ side, who knows).

On the other hand, once a segment falls below the repair threshold, then all missing pieces get repaired elsewhere and from that point on, initially missing or corrupted pieces will not get audited anymore as they’ve been repaired on other nodes; and as a side effect any old corrupted pieces would then get removed from the nodes eventually thanks to the garbage collection.

Is this correct?