Disqualified for unknown reason

Alestrix · January 5, 2020, 5:40pm

Hi,
My node 1ThiQRJKrtNdedXG8DSpKBoAEhBnSUgFRgwCd3NW1ogqiz1wrn has been running for about two months and has suddenly been disqualified yesterday (Saturday) on us-central and then this morning Sunday 8am GMT from Asia-east. Audit checks are always at about 85% (no idea why it is so low) and uptime above 99%.
There is no way that data has been corrupted at rest on my HDD as I’m using a ZFS mirror and scrub shows no errors either. Internet connection up since 32 days without hiccups.
What do I check from here?

Regards
Alex

Vadim · January 5, 2020, 5:41pm

Audit checks should be 100% if it lower this mean you lost data, this is disqualification

Alestrix · January 5, 2020, 5:59pm

As I said, there is absolutely no way that I lost data!

nerdatwork · January 5, 2020, 6:02pm

More accurately if audits < 60% then it leads to disqualification on particular satellite.

@Alestrix What does the dashboard show? Try searching the logs for “file does not exist”.

Alestrix · January 5, 2020, 6:24pm

Logs date back to January 1st (have no permanent Logs set up and watchtower updated the container back then) and there are no occurrences of “not exist”. Dashboard does not give any useful info.

Alexey · January 5, 2020, 6:33pm

The audit score is used for disqualification. Unfortunately you can see it only with scripts: Script for Audits stat by satellites
The dashboard show the audit checks for node lifetime. The low audit checks is showing that you could lost data a while ago, maybe on the start.

Audits are requested on random pieces of data, which your node supposedly should have. If it can’t gave the correct hash, it will be placed into the containment mode and will not receive any new data until answer on request for the same piece, it will be asked three more times. If node would not able to answer with the correct hash, the audit considered as failed.
The audit score will fall too fast, if your node fail several audits in row - this is usually mean that significant part of data is corrupted, lost or inaccessible - the result is disqualification.

Have you replaced the -v with the --mount option in your docker run command?

Alestrix · January 5, 2020, 7:22pm

These are the results from the German jury:

for sat in `docker exec -i storagenode wget -qO - localhost:14002/api/dashboard | jq .data.satellites[].id -r`; do docker exec -i storagenode wget -qO - localhost:14002/api/satellite/$sat | jq .data.id,.data.audit,.data.uptime; done

“118UWpMCHzs6CvSgWd9BfFVjw5K9pZbJjkfZJexMtSkmKxvvAW”
{
“totalCount”: 85,
“successCount”: 71,
“alpha”: 12.307473200731943,
“beta”: 7.405954134609243,
“score”: 0.6243193023400735
}
{
“totalCount”: 1140,
“successCount”: 1128,
“alpha”: 98.86614084050596,
“beta”: 1.1328231877341282,
“score”: 0.9886716507641597
}
“121RTSDpyNZVcEU84Ticf2L1ntiuUimbWgfATz21tuvgk3vzoA6”
{
“totalCount”: 120,
“successCount”: 102,
“alpha”: 11.71558933697745,
“beta”: 8.244084561827265,
“score”: 0.5869629632415506
}
{
“totalCount”: 1171,
“successCount”: 1164,
“alpha”: 99.34717212123873,
“beta”: 0.652069233150552,
“score”: 0.9934792581991729
}
“12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S”
{
“totalCount”: 127,
“successCount”: 108,
“alpha”: 11.765790542733754,
“beta”: 8.206048236795521,
“score”: 0.589119042698935
}
{
“totalCount”: 1176,
“successCount”: 1171,
“alpha”: 99.99129538047383,
“beta”: 0.007983155099024787,
“score”: 0.9999201678730494
}
“12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs”
{
“totalCount”: 79,
“successCount”: 68,
“alpha”: 14.518973507025914,
“beta”: 5.150719005273801,
“score”: 0.7381393226125427
}
{
“totalCount”: 1133,
“successCount”: 1120,
“alpha”: 99.8033305517938,
“beta”: 0.1955579684560842,
“score”: 0.9980443985793254
}

Though I would still like to know what and when the error was. There was no log message about missing files.

Alexey · January 5, 2020, 7:37pm

Please, answer on this question.

Unfortunately you have logs only since the last update.
However, audit score on all satellites is very low. This is mean, that your storage either lost data, or your node lost access to that data. There is no other known reasons.
When audit score on the satellite would fall below 0.6, the node will be disqualified on that satellite.

Alestrix · January 5, 2020, 8:31pm

Yes, it’s --mount. When I set up the node that was already the recommended parameter.

littleskunk · January 5, 2020, 9:18pm

In the satellite logs I see a lot of "Verify: download timeout (contained)". Last time with pieceID G57VNDIZNYFV2OSYB2OCF5LXUX7IALXI2TXVWNECJCPLDN7LXTEQ but there are lots of other pieceIDs.

You will soon get disqualified on the other satellites as well i guess.

Alestrix · January 5, 2020, 9:27pm

Thanks @littleskunk for the info. What could be the reason for that?

I am also seeing quite a few of these
piecestore failed to add order {"error": "ordersdb error: database is locked"
and these
rpc client error when receiveing new order settlements {"error": "order: failed to receive settlement response: context deadline exceeded"

The only log entries that contain the given piece id look like this:
2020-01-04T17:22:11.601Z INFO piecestore download started {“Piece ID”: “G57VNDIZNYFV2OSYB2OCF5LXUX7IALXI2TXVWNECJCPLDN7LXTEQ”, “Satellite ID”: “118UWpMCHzs6CvSgWd9BfFVjw5K9pZbJjkfZJexMtSkmKxvvAW”, “Action”: “GET_AUDIT”}
2020-01-04T18:13:15.323Z INFO piecestore download started {“Piece ID”: “G57VNDIZNYFV2OSYB2OCF5LXUX7IALXI2TXVWNECJCPLDN7LXTEQ”, “Satellite ID”: “118UWpMCHzs6CvSgWd9BfFVjw5K9pZbJjkfZJexMtSkmKxvvAW”, “Action”: “GET_AUDIT”}
2020-01-05T05:51:15.520Z INFO piecestore download started {“Piece ID”: “G57VNDIZNYFV2OSYB2OCF5LXUX7IALXI2TXVWNECJCPLDN7LXTEQ”, “Satellite ID”: “118UWpMCHzs6CvSgWd9BfFVjw5K9pZbJjkfZJexMtSkmKxvvAW”, “Action”: “GET_AUDIT”}
2020-01-05T08:04:21.269Z INFO piecestore download started {“Piece ID”: “G57VNDIZNYFV2OSYB2OCF5LXUX7IALXI2TXVWNECJCPLDN7LXTEQ”, “Satellite ID”: “118UWpMCHzs6CvSgWd9BfFVjw5K9pZbJjkfZJexMtSkmKxvvAW”, “Action”: “GET_AUDIT”}
2020-01-05T11:56:57.866Z INFO piecestore download started {“Piece ID”: “G57VNDIZNYFV2OSYB2OCF5LXUX7IALXI2TXVWNECJCPLDN7LXTEQ”, “Satellite ID”: “118UWpMCHzs6CvSgWd9BfFVjw5K9pZbJjkfZJexMtSkmKxvvAW”, “Action”: “GET_AUDIT”}
Not sure how I see whether that download succeeded, as the next lines in the log all seem unrelated.

littleskunk · January 5, 2020, 9:31pm

The timeout for audits is 5 minutes. So add 5 minutes to the download start and looks what you find there. You can also take a look at: Guide to debug my storage node

Alestrix · January 5, 2020, 10:01pm

There’s unfortunately nothing helpful at the 5 minute mark (checked three of the occurrences above). I’ll save the logs for now and will do more digging (and log-persistence configuring) when I’m back home in about a week.

Alestrix · January 5, 2020, 10:14pm

PS: Could the satellite(s) have a problem with my node not using the default port? I’m advertising port 9876 in the ADDRESS environment variable and that’s also where my node is reachable at (always shows as online as well).

littleskunk · January 6, 2020, 1:03am

No. The satellite has a problem because you don’t send the data within the 5 minute window.

deathlessdd · January 6, 2020, 2:14am

My guess one of your drives is failing slowly, Is why your missing data theres no other reason to loose data less you had some crashes and corrupt datastores

Alestrix · January 6, 2020, 8:57am

Drive failure or file system corruption can be ruled out. This cannot happen without being noticed (and trigger respective warnings) on ZFS. And as this is a mirror, one failed drive wouldn’t cause data loss either.
I’ll do more log file and statistics digging when I’m back from skiing .

adorid · January 8, 2020, 5:24pm

I got the same problem when node get disqualified by unknown reason.
So i run docker and as storage use nfs share, when my node get disqualified i found that storj docker usirng 3G+ ram so probably its somehow related to this problem too as in normal condition docker using few hundred mb ram.
But as i see storj node in normal conditions cant anymore stop and need storj docker stop with rm --force, are there some way to repair node db with data recheck?

So restarted node and after some time in logs found 1st download fail.

Could it be that storj node are very sensitive to storage delay and if there is increased usage to network storage it set upload as failed as it has too hight ms for writing data. That could explain this memory leak when start to fail uploads.

Alexey · January 8, 2020, 6:50pm

The only known reason for disqualification is missed or inaccessible data.

Neither NFS nor SMB are supported. They are not compatible with SQLite. The only compatible network protocol is iSCSI. However, any network connected storage will have a higher latency than a local connected. Your node will loose the race for pieces more often.
Consider to use a local connected drives instead.

To stop the node you should use timeout. In case of network connected drive it should be in 2-3 times greater.

docker stop -t 600 storagenode

Alestrix · January 8, 2020, 9:30pm

@Alexey can you please elaborate on “NFS is not compatible with SQLite”? [EDIT: consider this deleted By itself, that statement doesn’t make any technical sense,] but StorJ might use SQLite in an unusual way that perhaps emphasizes some of NFS’s shortcomings. I’m always eager to learn.

EDIT: Apparently it does make major sense due to some locking mechanisms. I’ll do more reading…

Disqualified for unknown reason

for sat in docker exec -i storagenode wget -qO - localhost:14002/api/dashboard | jq .data.satellites[].id -r; do docker exec -i storagenode wget -qO - localhost:14002/api/satellite/$sat | jq .data.id,.data.audit,.data.uptime; done

for sat in `docker exec -i storagenode wget -qO - localhost:14002/api/dashboard | jq .data.satellites[].id -r`; do docker exec -i storagenode wget -qO - localhost:14002/api/satellite/$sat | jq .data.id,.data.audit,.data.uptime; done