Krawi
August 29, 2020, 6:08pm
1
Hi Operators
One of my nodes gets failed audits from saltlake satellite, but they don’t show up in the logs.
Query by API shows me:
“1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE”
{
“totalCount”: 7017,
“successCount”: 7012,
“alpha”: 18.33982158510172,
“beta”: 1.6601784148982586,
“unknownAlpha”: 19.99999999999995,
“unknownBeta”: 0,
“score”: 0.916991079255087,
“unknownScore”: 1
}
I switched yesterday to external logs with logrotate and got 2 failed audits since then.
When i search for audits from saltlake with:
grep GET_AUDIT /var/storj/storagenode.log | grep 1wFT
the results are over 40 successfull audits with pairs of lines with “download started” & “downloaded” entries. (No missing lines)
But a search especially for failed audits with:
grep GET_AUDIT /var/storj/storagenode.log | grep 1wFT | grep failed
shows nothing.
In Grafana i see some Uptime errors for saltlake:
Are there perhaps connection issues between the node and saltlake?
(Node is running on a VPS with Gigabit connection in Germany,
ID: 12d9NBMWj25DpWG4nN2BWfXLy7Jidb83QjdGWBn6DP9mabvmtkf)
traceroute results from node to saltlake:
traceroute to saltlake.tardigrade.io (34.94.235.192), 30 hops max, 60 byte packets
1 ip-173-212-238-3.static.contabo.net (173.212.238.3) 0.592 ms 0.554 ms 0.593 ms
2 * * GIGA-HOSTING.edge7.Frankfurt1.Level3.net (195.16.162.233) 6.486 ms
3 * * *
4 72.14.242.122 (72.14.242.122) 9.672 ms 6.931 ms 10.144 ms
5 108.170.251.144 (108.170.251.144) 19.850 ms 108.170.252.19 (108.170.252.19) 7.176 ms 108.170.252.65 (108.170.252.65) 18.436 ms
6 108.170.228.9 (108.170.228.9) 18.423 ms 108.170.229.168 (108.170.229.168) 18.863 ms 108.170.251.144 (108.170.251.144) 6.822 ms
7 209.85.242.79 (209.85.242.79) 9.429 ms 9.453 ms 108.170.228.9 (108.170.228.9) 15.599 ms
8 142.250.60.102 (142.250.60.102) 93.886 ms 209.85.253.185 (209.85.253.185) 16.133 ms 16.237 ms
9 209.85.242.252 (209.85.242.252) 93.923 ms 142.250.60.102 (142.250.60.102) 94.125 ms 216.239.48.9 (216.239.48.9) 103.411 ms
10 216.239.48.9 (216.239.48.9) 100.782 ms 100.538 ms *
11 * * *
12 108.170.231.153 (108.170.231.153) 155.637 ms 209.85.251.183 (209.85.251.183) 156.341 ms 209.85.249.51 (209.85.249.51) 122.227 ms
13 108.170.234.24 (108.170.234.24) 155.756 ms 209.85.251.129 (209.85.251.129) 154.211 ms 108.170.234.125 (108.170.234.125) 154.086 ms
14 108.170.234.39 (108.170.234.39) 154.125 ms 209.85.245.172 (209.85.245.172) 156.754 ms 108.170.234.26 (108.170.234.26) 148.569 ms
15 108.170.247.161 (108.170.247.161) 153.568 ms 74.125.252.196 (74.125.252.196) 153.856 ms 153.492 ms
16 * 209.85.254.66 (209.85.254.66) 153.836 ms *
17 * * *
18 * * *
19 * * *
20 * * *
21 * 192.235.94.34.bc.googleusercontent.com (34.94.235.192) 153.520 ms *
Any ideas to solve this?
1 Like
Those 5 failed audits are probably lost in logs that were not redirected. Look out for them in the redirected log.
Krawi
August 29, 2020, 6:33pm
3
Redirection was yesterday, the count of failed audits was 3 when i switched.
Try searching just for GET_AUDIT
and failed
together
Krawi
August 29, 2020, 7:45pm
5
The logfile has alltogether 6 failed entries, all belonging to GET or GET_REPAIR requests.
Alexey
August 29, 2020, 7:54pm
6
Your node could fail audit because of timeout
The timeout for audits is 5 minutes. So add 5 minutes to the download start and looks what you find there. You can also take a look at: Guide to debug my storage node
Krawi
August 29, 2020, 8:42pm
7
OK, add 5 minutes to download start, but all audits from saltlake are logged as successfull. There isn’t an audit not solved, which time should i take as start time?
Here are all lines with audits from saltlake:
https://pastebin.com/DGL8WN3u
And all failed requests:
https://pastebin.com/xK8Xqeqh
Krawi
August 29, 2020, 9:30pm
8
Hmm, the last 2 failed requests are GET_REPAIR from saltlake, with some linebreaks added they looks like this:
2020-08-29T03:16:04.253Z ERROR piecestore download failed
{“Piece ID”: “PQBFFOIZBHYP6GZAZPQQNBWZX4SRDD6I464K5XSDYL6ZUUE4HUAQ”,
“Satellite ID”: “1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE”,
“Action”: “GET_REPAIR”,
“error”: “pieces error: PieceHeader framing field claims impossible size of 3328 bytes”,
“errorVerbose”: “pieces error: PieceHeader framing field claims impossible size of 3328 bytes
tstorj.io/storj/storagenode/pieces.(*Reader).GetPieceHeader:275
tstorj.io/storj/storagenode/pieces.(*Store).GetHashAndLimit:455
tstorj.io/storj/storagenode/piecestore.(*Endpoint).Download:528
tstorj.io/common/pb.DRPCPiecestoreDescription.Method.func2:1004
tstorj.io/drpc/drpcmux.(*Mux).HandleRPC:107
tstorj.io/common/rpc/rpctracing.(*Handler).HandleRPC:56
tstorj.io/drpc/drpcserver.(*Server).handleRPC:111
tstorj.io/drpc/drpcserver.(*Server).ServeOne:62
tstorj.io/drpc/drpcserver.(*Server).Serve.func2:99
tstorj.io/drpc/drpcctx.(*Tracker).track:51 ”}
2020-08-29T04:32:23.001Z ERROR piecestore download failed
{“Piece ID”: “43Y2FE5ZCP4KHLLPCWHQ6KEZLE2MU2T3FGVXFQFN5ENDYD2LWHWA”,
“Satellite ID”: “1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE”,
“Action”: “GET_REPAIR”,
“error”: “pieces error: PieceHeader framing field claims impossible size of 36768 bytes”,
“errorVerbose”: “pieces error: PieceHeader framing field claims impossible size of 36768 bytes
tstorj.io/storj/storagenode/pieces.(*Reader).GetPieceHeader:275
tstorj.io/storj/storagenode/pieces.(*Store).GetHashAndLimit:455
tstorj.io/storj/storagenode/piecestore.(*Endpoint).Download:528
tstorj.io/common/pb.DRPCPiecestoreDescription.Method.func2:1004
tstorj.io/drpc/drpcmux.(*Mux).HandleRPC:107
tstorj.io/common/rpc/rpctracing.(*Handler).HandleRPC:56
tstorj.io/drpc/drpcserver.(*Server).handleRPC:111
tstorj.io/drpc/drpcserver.(*Server).ServeOne:62
tstorj.io/drpc/drpcserver.(*Server).Serve.func2:99
tstorj.io/drpc/drpcctx.(*Tracker).track:51 ”}
Alexey
August 29, 2020, 11:10pm
9
GET and GET_REPAIR have nothing to do with failed audits. Only failed GET_AUDIT have an influence on it.
Krawi
September 7, 2020, 10:26am
10
Today was another failed audit, but again it didn’t show up in the logs:
grep GET_AUDIT /var/storj/storagenode.log | grep 1wFT | grep “download started” | wc -l
Number of lines = 11
grep GET_AUDIT /var/storj/storagenode.log | grep 1wFT | grep “downloaded” | wc -l
Number of lines = 11
Since the number of lines is equal every audit was successfull.
I will change the debug level from info to debug and wait for the next failed audit.
I don’t think setting debug will have any effect.
The pieces aren’t checked for corruption when sent from the storagenode. So if you send bad data there is no log of it.
I didn’t want storagenode9 to get suspended, as this doesn’t help with the question of storagenode1. So I’ve put the data back. Just not the data it will be expecting.
I think it is conclusive …
If the blobs are corrupt then audits pass at the storagenode end
2020-08-21T17:50:08.567Z
INFO
piecestore
download started
{“Piece ID”: “W2D7LKOATVEZPNSI7EDPUJTHXIC4YEES6PUBVRPXT6RYOBUJN35A”, “Satellite ID”: “12rfG3sh9NCWiX3ivPjq2HtdLmbqCrvHVEzJubnzFzosMuawymB”, “Action”: “GET_AUDIT”}
2020-0…
I don’t find that nodeID in the logs. It doesn’t look like there was any audit error in the last few days.
Krawi
September 9, 2020, 10:24am
13
Next steps done so far:
Enabled fsck at every boot time and rebooted. (But without logs from fsck i can’t say anything about possible problems repaired)
Made a simple script that logs audits using the API and set up a cronjob every 5 minutes.
Now i’m waiting for the next failed audit.
Krawi
September 15, 2020, 10:24am
14
Next one: today 2 from 11 audits failed (according to the API) from saltlake between 07:50 and 07:55 (UTC), but the logs shows only 6 GET_REPAIR (all successfull) from saltlake during that time.
Anybody from STORJ interested to see the logs?
We don’t have failed audits in the API.
Krawi
September 15, 2020, 10:50am
16
https://forum.storj.io/t/script-for-audits-stat-by-satellites/988
I have changed it to check only for saltlake audits:
docker exec -i storagenode wget -qO - localhost:14002/api/sno/satellite/1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE | jq .id,.audit >>/var/storj/saltlake-audits.log
Result:
“1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE”
{
“totalCount”: 7563,
“successCount”: 7552,
“alpha”: 18.185474,
“beta”: 1.814526,
“unknownAlpha”: 20,
“unknownBeta”: 0,
“score”: 0.9092737,
“unknownScore”: 1
}
The difference between totalCount and successCount is the number of unsuccesfull audits, now 11.
My Grafana Dashboard shows the same number, STORJ Dashboard has a score of 90.9% (0.9092737 * 100)
That part is not correct.
You should watch the beta and unknownBeta score. They will increase if you fail an audit. They are also used for calculating your score.
Krawi
September 15, 2020, 11:38am
19
Good to know. But the beta score value has increased, so something had happend.
From my own log:
2020-09-15T07:50:01Z
“1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE”
{
“totalCount”: 7557,
“successCount”: 7548,
“alpha”: 19.999974,
“beta”: 2.6e-05,
“unknownAlpha”: 20,
“unknownBeta”: 0,
“score”: 0.9999987000000001,
“unknownScore”: 1
}
2020-09-15T07:50:02Z
2020-09-15T07:55:01Z
“1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE”
{
“totalCount”: 7563,
“successCount”: 7552,
“alpha”: 18.185474,
“beta”: 1.814526,
“unknownAlpha”: 20,
“unknownBeta”: 0,
“score”: 0.9092737,
“unknownScore”: 1
}
2020-09-15T07:55:03Z
What is your nodeID? I don’t see anything for the nodeID you posted earlier. Please double check the ID.
Krawi
September 15, 2020, 11:45am
21
Node ID: 12d9NBMWj25DpWG4nN2BWfXLy7Jidb83QjdGWBn6DP9mabvmtkf
(Copy & Paste from Dashboard using the button on the right side of the ID)