Failed audits in Dashboard, but not in logs

Krawi · August 29, 2020, 6:08pm

Hi Operators

One of my nodes gets failed audits from saltlake satellite, but they don’t show up in the logs.

Query by API shows me:

“1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE”
{
“totalCount”: 7017,
“successCount”: 7012,
“alpha”: 18.33982158510172,
“beta”: 1.6601784148982586,
“unknownAlpha”: 19.99999999999995,
“unknownBeta”: 0,
“score”: 0.916991079255087,
“unknownScore”: 1
}

I switched yesterday to external logs with logrotate and got 2 failed audits since then.

When i search for audits from saltlake with:

grep GET_AUDIT /var/storj/storagenode.log | grep 1wFT

the results are over 40 successfull audits with pairs of lines with “download started” & “downloaded” entries. (No missing lines)

But a search especially for failed audits with:

grep GET_AUDIT /var/storj/storagenode.log | grep 1wFT | grep failed

shows nothing.

In Grafana i see some Uptime errors for saltlake:
Grafana

Are there perhaps connection issues between the node and saltlake?
(Node is running on a VPS with Gigabit connection in Germany,
ID: 12d9NBMWj25DpWG4nN2BWfXLy7Jidb83QjdGWBn6DP9mabvmtkf)

traceroute results from node to saltlake:

traceroute to saltlake.tardigrade.io (34.94.235.192), 30 hops max, 60 byte packets
1 ip-173-212-238-3.static.contabo.net (173.212.238.3) 0.592 ms 0.554 ms 0.593 ms
2 * * GIGA-HOSTING.edge7.Frankfurt1.Level3.net (195.16.162.233) 6.486 ms
3 * * *
4 72.14.242.122 (72.14.242.122) 9.672 ms 6.931 ms 10.144 ms
5 108.170.251.144 (108.170.251.144) 19.850 ms 108.170.252.19 (108.170.252.19) 7.176 ms 108.170.252.65 (108.170.252.65) 18.436 ms
6 108.170.228.9 (108.170.228.9) 18.423 ms 108.170.229.168 (108.170.229.168) 18.863 ms 108.170.251.144 (108.170.251.144) 6.822 ms
7 209.85.242.79 (209.85.242.79) 9.429 ms 9.453 ms 108.170.228.9 (108.170.228.9) 15.599 ms
8 142.250.60.102 (142.250.60.102) 93.886 ms 209.85.253.185 (209.85.253.185) 16.133 ms 16.237 ms
9 209.85.242.252 (209.85.242.252) 93.923 ms 142.250.60.102 (142.250.60.102) 94.125 ms 216.239.48.9 (216.239.48.9) 103.411 ms
10 216.239.48.9 (216.239.48.9) 100.782 ms 100.538 ms *
11 * * *
12 108.170.231.153 (108.170.231.153) 155.637 ms 209.85.251.183 (209.85.251.183) 156.341 ms 209.85.249.51 (209.85.249.51) 122.227 ms
13 108.170.234.24 (108.170.234.24) 155.756 ms 209.85.251.129 (209.85.251.129) 154.211 ms 108.170.234.125 (108.170.234.125) 154.086 ms
14 108.170.234.39 (108.170.234.39) 154.125 ms 209.85.245.172 (209.85.245.172) 156.754 ms 108.170.234.26 (108.170.234.26) 148.569 ms
15 108.170.247.161 (108.170.247.161) 153.568 ms 74.125.252.196 (74.125.252.196) 153.856 ms 153.492 ms
16 * 209.85.254.66 (209.85.254.66) 153.836 ms *
17 * * *
18 * * *
19 * * *
20 * * *
21 * 192.235.94.34.bc.googleusercontent.com (34.94.235.192) 153.520 ms *

Any ideas to solve this?

nerdatwork · August 29, 2020, 6:12pm

Those 5 failed audits are probably lost in logs that were not redirected. Look out for them in the redirected log.

Krawi · August 29, 2020, 6:33pm

Redirection was yesterday, the count of failed audits was 3 when i switched.

nerdatwork · August 29, 2020, 6:45pm

Try searching just for GET_AUDIT and failed together

Krawi · August 29, 2020, 7:45pm

The logfile has alltogether 6 failed entries, all belonging to GET or GET_REPAIR requests.

Alexey · August 29, 2020, 7:54pm

Your node could fail audit because of timeout

Krawi · August 29, 2020, 8:42pm

OK, add 5 minutes to download start, but all audits from saltlake are logged as successfull. There isn’t an audit not solved, which time should i take as start time?

Here are all lines with audits from saltlake:
https://pastebin.com/DGL8WN3u

And all failed requests:
https://pastebin.com/xK8Xqeqh

Krawi · August 29, 2020, 9:30pm

Hmm, the last 2 failed requests are GET_REPAIR from saltlake, with some linebreaks added they looks like this:

2020-08-29T03:16:04.253Z ERROR piecestore download failed
{“Piece ID”: “PQBFFOIZBHYP6GZAZPQQNBWZX4SRDD6I464K5XSDYL6ZUUE4HUAQ”,
“Satellite ID”: “1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE”,
“Action”: “GET_REPAIR”,
“error”: “pieces error: PieceHeader framing field claims impossible size of 3328 bytes”,
“errorVerbose”: “pieces error: PieceHeader framing field claims impossible size of 3328 bytes
tstorj.io/storj/storagenode/pieces.(*Reader).GetPieceHeader:275
tstorj.io/storj/storagenode/pieces.(*Store).GetHashAndLimit:455
tstorj.io/storj/storagenode/piecestore.(*Endpoint).Download:528
tstorj.io/common/pb.DRPCPiecestoreDescription.Method.func2:1004
tstorj.io/drpc/drpcmux.(*Mux).HandleRPC:107
tstorj.io/common/rpc/rpctracing.(*Handler).HandleRPC:56
tstorj.io/drpc/drpcserver.(*Server).handleRPC:111
tstorj.io/drpc/drpcserver.(*Server).ServeOne:62
tstorj.io/drpc/drpcserver.(*Server).Serve.func2:99
tstorj.io/drpc/drpcctx.(*Tracker).track:51”}

2020-08-29T04:32:23.001Z ERROR piecestore download failed
{“Piece ID”: “43Y2FE5ZCP4KHLLPCWHQ6KEZLE2MU2T3FGVXFQFN5ENDYD2LWHWA”,
“Satellite ID”: “1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE”,
“Action”: “GET_REPAIR”,
“error”: “pieces error: PieceHeader framing field claims impossible size of 36768 bytes”,
“errorVerbose”: “pieces error: PieceHeader framing field claims impossible size of 36768 bytes
tstorj.io/storj/storagenode/pieces.(*Reader).GetPieceHeader:275
tstorj.io/storj/storagenode/pieces.(*Store).GetHashAndLimit:455
tstorj.io/storj/storagenode/piecestore.(*Endpoint).Download:528
tstorj.io/common/pb.DRPCPiecestoreDescription.Method.func2:1004
tstorj.io/drpc/drpcmux.(*Mux).HandleRPC:107
tstorj.io/common/rpc/rpctracing.(*Handler).HandleRPC:56
tstorj.io/drpc/drpcserver.(*Server).handleRPC:111
tstorj.io/drpc/drpcserver.(*Server).ServeOne:62
tstorj.io/drpc/drpcserver.(*Server).Serve.func2:99
tstorj.io/drpc/drpcctx.(*Tracker).track:51”}

Alexey · August 29, 2020, 11:10pm

GET and GET_REPAIR have nothing to do with failed audits. Only failed GET_AUDIT have an influence on it.

Krawi · September 7, 2020, 10:26am

Today was another failed audit, but again it didn’t show up in the logs:

grep GET_AUDIT /var/storj/storagenode.log | grep 1wFT | grep “download started” | wc -l

Number of lines = 11

grep GET_AUDIT /var/storj/storagenode.log | grep 1wFT | grep “downloaded” | wc -l

Number of lines = 11

Since the number of lines is equal every audit was successfull.
I will change the debug level from info to debug and wait for the next failed audit.

andrew2.hart · September 7, 2020, 11:25am

I don’t think setting debug will have any effect.

The pieces aren’t checked for corruption when sent from the storagenode. So if you send bad data there is no log of it.

littleskunk · September 8, 2020, 10:40am

I don’t find that nodeID in the logs. It doesn’t look like there was any audit error in the last few days.

Krawi · September 9, 2020, 10:24am

Next steps done so far:

Enabled fsck at every boot time and rebooted. (But without logs from fsck i can’t say anything about possible problems repaired)
Made a simple script that logs audits using the API and set up a cronjob every 5 minutes.

Now i’m waiting for the next failed audit.

Krawi · September 15, 2020, 10:24am

Next one: today 2 from 11 audits failed (according to the API) from saltlake between 07:50 and 07:55 (UTC), but the logs shows only 6 GET_REPAIR (all successfull) from saltlake during that time.

Anybody from STORJ interested to see the logs?

littleskunk · September 15, 2020, 10:25am

We don’t have failed audits in the API.

Krawi · September 15, 2020, 10:50am

https://forum.storj.io/t/script-for-audits-stat-by-satellites/988

I have changed it to check only for saltlake audits:

docker exec -i storagenode wget -qO - localhost:14002/api/sno/satellite/1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE | jq .id,.audit >>/var/storj/saltlake-audits.log

Result:

“1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE”
{
“totalCount”: 7563,
“successCount”: 7552,
“alpha”: 18.185474,
“beta”: 1.814526,
“unknownAlpha”: 20,
“unknownBeta”: 0,
“score”: 0.9092737,
“unknownScore”: 1
}

The difference between totalCount and successCount is the number of unsuccesfull audits, now 11.
My Grafana Dashboard shows the same number, STORJ Dashboard has a score of 90.9% (0.9092737 * 100)

littleskunk · September 15, 2020, 11:24am

That part is not correct.

You should watch the beta and unknownBeta score. They will increase if you fail an audit. They are also used for calculating your score.

Krawi · September 15, 2020, 11:38am

Good to know. But the beta score value has increased, so something had happend.
From my own log:

2020-09-15T07:50:01Z
“1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE”
{
“totalCount”: 7557,
“successCount”: 7548,
“alpha”: 19.999974,
“beta”: 2.6e-05,
“unknownAlpha”: 20,
“unknownBeta”: 0,
“score”: 0.9999987000000001,
“unknownScore”: 1
}
2020-09-15T07:50:02Z

2020-09-15T07:55:01Z
“1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE”
{
“totalCount”: 7563,
“successCount”: 7552,
“alpha”: 18.185474,
“beta”: 1.814526,
“unknownAlpha”: 20,
“unknownBeta”: 0,
“score”: 0.9092737,
“unknownScore”: 1
}
2020-09-15T07:55:03Z

littleskunk · September 15, 2020, 11:41am

What is your nodeID? I don’t see anything for the nodeID you posted earlier. Please double check the ID.

Krawi · September 15, 2020, 11:45am

Node ID: 12d9NBMWj25DpWG4nN2BWfXLy7Jidb83QjdGWBn6DP9mabvmtkf

(Copy & Paste from Dashboard using the button on the right side of the ID)