Suspension and Audit Warning

kupan787 · October 19, 2020, 6:47pm

I just happened to take a look at my dashboard today and noticed the following:

I’m not aware of anything on my end changing, so not sure what has caused Audit to drop on those two. Is there anything I shovel check/do to rectify those?

baker · October 19, 2020, 6:49pm

Audit failures can only happen if you are missing data, or the requested data isn’t uploaded within 5 mins, 5 times. Search your logs for both “GET_AUDIT” and “failed” in the same line.

kupan787 · October 19, 2020, 7:09pm

Do you know how to access the logs from docker? I know the command:

docker logs -f storagenode

But that just scrolls the current log info (like a tail -f would). Is there a place where I can access the file to scroll through the historical data?

baker · October 19, 2020, 7:13pm

Is this a linux node? Unless you redirect logs to a file (I recommend you do this, but don’t redirect until we have searched your current logs), there is no file to browse. You can grep the output of the docker logs command easily though.

–edit-- Corrected command as per below

docker logs storagenode 2>&1 | grep "GET_AUDIT" | grep "failed"

littleskunk · October 19, 2020, 7:18pm

On windows you can simply enable linux toolbox and run the same command

kupan787 · October 19, 2020, 7:40pm

baker:

Is this a linux node? Unless you redirect logs to a file (I recommend you do this, but don’t redirect until we have searched your current logs), there is no file to browse. You can grep the output of the docker logs command easily though.
grep "GET_AUDIT" docker logs storagenode | grep "failed"

Ya, it a linux node. After we get through this, I’ll update to redirect the logs.

At any rate, running the above:

[storj@nas ~]# grep "GET_AUDIT" docker logs storagenode | grep "failed"
grep: docker: No such file or directory
grep: logs: No such file or directory
grep: storagenode: No such file or directory

I tried changing around the command and see:

[storj@nas ~]# docker logs storagenode 2>&1 | grep "GET_AUDIT" | grep "failed"

2020-10-18T11:15:46.317Z ERROR piecestore download failed {"Piece ID": "FC6ORW5TNM6WEZNSYCDH3DP22SEML7WTCS376TUS4DCOJB2AU5BQ", "Satellite ID": "12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs", "Action": "GET_AUDIT", "error": "file does not exist", "errorVerbose": "file does not exist\n\tstorj.io/common/rpc/rpcstatus.Wrap:74\n\tstorj.io/storj/storagenode/piecestore.(*Endpoint).Download:505\n\tstorj.io/common/pb.DRPCPiecestoreDescription.Method.func2:1004\n\tstorj.io/drpc/drpcmux.(*Mux).HandleRPC:107\n\tstorj.io/common/rpc/rpctracing.(*Handler).HandleRPC:56\n\tstorj.io/drpc/drpcserver.(*Server).handleRPC:111\n\tstorj.io/drpc/drpcserver.(*Server).ServeOne:62\n\tstorj.io/drpc/drpcserver.(*Server).Serve.func2:99\n\tstorj.io/drpc/drpcctx.(*Tracker).track:51"}

2020-10-18T13:16:50.712Z ERROR piecestore download failed {"Piece ID": "KPVFLEOCCLAJKC77JAUWEGHYBHINNZOKYTV7OELMPH2SFSGSGNZQ", "Satellite ID": "12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs", "Action": "GET_AUDIT", "error": "file does not exist", "errorVerbose": "file does not exist\n\tstorj.io/common/rpc/rpcstatus.Wrap:74\n\tstorj.io/storj/storagenode/piecestore.(*Endpoint).Download:505\n\tstorj.io/common/pb.DRPCPiecestoreDescription.Method.func2:1004\n\tstorj.io/drpc/drpcmux.(*Mux).HandleRPC:107\n\tstorj.io/common/rpc/rpctracing.(*Handler).HandleRPC:56\n\tstorj.io/drpc/drpcserver.(*Server).handleRPC:111\n\tstorj.io/drpc/drpcserver.(*Server).ServeOne:62\n\tstorj.io/drpc/drpcserver.(*Server).Serve.func2:99\n\tstorj.io/drpc/drpcctx.(*Tracker).track:51"}

Which looks like there are two entries of “file does not exist”.

So is there any way to correct this? Like some how have my node go grab the missing files?

baker · October 19, 2020, 7:45pm

Right. Sorry, your command is correct. I wasn’t thinking and gave you the command based on redirected logs

There is no way to have these files restored, once they are gone they are gone. If it is a small number of files, your node should survive. Did you make any changes recently? You should definitely check your filesystem for errors.

kupan787 · October 19, 2020, 10:08pm

I had a disk fail (I’m running on a RAID10), but got it replaced. I also had an unexpected power loss.

Either way, I’ll keep an eye on things. Sounds like there isn’t anything I can do at this point other than just wait it out.

So in the case where the file was not found, does something get notified so in the future it doesn’t keep trying to look at me for the file?

What would be cool, and not sure if it would be possible, but somehow if some of my earnings could be used to pay to get the missing pieces from someone that has it. Kind of a self-healing, by grabbing the missing data from another node. Since the data should exist on another node, and having consistency amongst nodes with those pieces would be good, some repair system would be a neat feature.

baker · October 19, 2020, 10:29pm

Eventually when there are enough missing pieces (number available falls below the repair threshold), the file will be rebuilt and at that point the satellite will no long try to audit that missing piece on your node. Until then your node could be audited for that piece again. But statistically this is fairly unlikely. You would have to be missing a fair number of pieces to be DQ’d.

This has been talked about a lot in the past, but the network repair process already takes care of this (eventually). My loose understanding of how erasure coding works means the satellite would have to collect the pieces anyway to re-instate yours, so the computational power/bandwidth will be the same either way. So it makes more sense to do this at the time that the available pieces drops below the repair threshold.

The system is designed that a few file errors here and there can be tolerated and your node shouldn’t be DQ’d. If you are missing enough pieces that you fail enough audits to DQ you, you node should no longer be trusted anyway. So while on the surface it may seem advantageous to allow nodes to repair missing pieces on their own, it actually doesn’t make sense to do it that way.

Here is a good blog post if you are interested in a quick overview of how erasure coding is used on the network.