Missing Piece before the wipe? - Critical audit alert

KernelPanick · July 22, 2019, 12:17pm

i have always stopped this node with stop -t 300, besides when allowing watchtower to do it.

My network has been mint since the wipe, so i suspect that watchtower is causing missing pieces, but because i don’t dump my log files somewhere else yet, i can’t see when it was attempted to be downloaded.

storj@fractal:~$ sudo docker logs storagenode 2>&1 | grep R3DUH
2019-07-22T05:16:51.381Z        INFO    piecestore      download started        {"Piece ID": "R3DUH5HHLKGU5QO7BBJT6SLUXD77JVUL54GI7J77IH7TSBJQ3TBA", "SatelliteID": "121RTSDpyNZVcEU84Ticf2L1ntiuUimbWgfATz21tuvgk3vzoA6", "Action": "GET_AUDIT"}
2019-07-22T05:16:51.382Z        INFO    piecestore      download failed {"Piece ID": "R3DUH5HHLKGU5QO7BBJT6SLUXD77JVUL54GI7J77IH7TSBJQ3TBA", "SatelliteID": "121RTSDpyNZVcEU84Ticf2L1ntiuUimbWgfATz21tuvgk3vzoA6", "Action": "GET_AUDIT", "error": "rpc error: code = NotFound desc = open config/storage/blobs/qstuylguhrn2ozjv4h2c6xpxykd622gtgurhql2k7k75wqaaaaaa/r3/duh5hhlkgu5qo7bbjt6sluxd77jvul54gi7j77ih7tsbjq3tba: no such file or directory"}

KernelPanick · July 23, 2019, 1:32am

I can’t figure out what caused the high load. The critical audit alert appeared with high IOwait. I can’t figure out what the server was doing, other than maybe it was getting DOS’d outside, or the storj database was busy. Traffic didn’t increase, PUT/GETs started each minute held steady.

The r3 folder does not exist;
config/storage/blobs/qstuylguhrn2ozjv4h2c6xpxykd622gtgurhql2k7k75wqaaaaaa/r3/duh5hhlkgu5qo7bbjt6sluxd77jvul54gi7j77ih7tsbjq3tba

el4y327s · July 24, 2019, 6:35am

Hi since yesterday the volume of data and space stopped growing.

But still i don’t have a restriction of my BW and i far from ton reach my total size.

I have some informations like that on my log file : ERROR rpc error: code = NotFound desc = open config/storage/blobs/ukfu6bhbboxilvt7jrwlqk7y2tapb5d2r2tsmj2sjxvw5qaaaaaa/j6/dq5orrp6zwddrigik3jhns5n32pziwynp4mkraix56cwpqphfa: no such file or directory

But just a few

an i have this one too ERROR: 2019/07/24 05:03:51 pickfirstBalancer: failed to NewSubConn: rpc error: code = Canceled desc = grpc: the client connection is closing

Can anyone help me ?

Regards

BrightSilence · July 24, 2019, 7:35am

What OS are you using and could you post the run command you are using.

The no such file or directory errors are serious. It means you are missing data. Make sure you are using the --mount syntax for your mounts and not -v. See https://documentation.storj.io/ for the full syntax, it’s different from the -v syntax.

KernelPanick · July 25, 2019, 4:41am

regarding the missing pieces on sat 121…

kevink · August 1, 2019, 10:56am

Although my node hasn’t been offline, forcefully restarted or similar harmful things, I get the occasional unrecoverable audit failure because of “lost” files:

download failed {“Piece ID”: “7WWAHA5Y4P7F5GU5IXLNLJC2GNZR52IEYPYBNY2XTN4UMBCYD7JA”, “SatelliteID”: “12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S”, “Action”: “GET_AUDIT”, “error”: “rpc error: code = NotFound desc = open config/storage/blobs/ukfu6bhbboxilvt7jrwlqk7y2tapb5d2r2tsmj2sjxvw5qaaaaaa/7w/waha5y4p7f5gu5ixlnljc2gnzr52ieypybny2xtn4umbcyd7ja: no such file or directory”}

This seems to be out of my control and the failure rate is very low.

But my question are:
As I understand it from chats and other threads that soon even one unrecoverable audit failure will lead to a node being disqualified from that sattelite. Is there no way to recover from a failed audit or implement that a client periodically scans if it has all files it is supposed to have?
Will my node survive the (hopefully) few files that are missing or will the sattelites keep asking for these?

Of course it shouldn’t happen that I lose any file and I honestly don’t know why I should have lost it but that’s just what happened since the last upgrade. The file (or even directory) definitely doesn’t exist on my HDD and I’m quite sure it didn’t just delete itself.

KernelPanick · August 1, 2019, 3:35pm

Not sure if they have an update for that. But i’ve not seen any unrecoverables since 16.1 yet.

kevink · August 1, 2019, 7:18pm

Oh thanks a lot. Didn’t know that this was a known issue. Then I’ll just wait.
(Searching if it was ever uploaded is going to be a problem since I flush my logs regularly or maybe I should find it in the DB but I didn’t look into that)

KernelPanick · August 2, 2019, 2:51am

update:

KernelPanick · August 19, 2019, 12:38pm

i got another one in v17. The missing piece occurred before v17, because i do not have history of the attempted download.

2019-08-19T12:02:40.041138454Z 2019-08-19T12:02:40.040Z INFO piecestore download started {“Piece ID”: “42HAXJHIKGAJEH33TVFHQGOT5CAWNLSIHWOS5HKOKLTVT7UGS4GA”, “SatelliteID”: “118UWpMCHzs6CvSgWd9BfFVjw5K9pZbJjkfZJexMtSkmKxvvAW”, “Action”: “GET_AUDIT”}
2019-08-19T12:02:40.042278991Z 2019-08-19T12:02:40.042Z INFO piecestore download failed {“Piece ID”: “42HAXJHIKGAJEH33TVFHQGOT5CAWNLSIHWOS5HKOKLTVT7UGS4GA”, “SatelliteID”: “118UWpMCHzs6CvSgWd9BfFVjw5K9pZbJjkfZJexMtSkmKxvvAW”, “Action”: “GET_AUDIT”, “error”: “rpc error: code = NotFound desc = open config/storage/blobs/abforhuxbzyd35blusvrifvdwmfx4hmocsva4vmpp3rgqaaaaaaa/42/haxjhikgajeh33tvfhqgot5cawnlsihwos5hkokltvt7ugs4ga: no such file or directory”}

Alexey · August 20, 2019, 9:38pm

Added link to this thread to the issue

aeleos · August 21, 2019, 5:23pm

I keep getting 1 failed audit and I don’t think its an issue with my storage.

Running this command

docker logs storagenode 2>&1 | grep GET_AUDIT | grep failed | grep open | grep 118

I get

2019-08-19T08:16:56.392Z        INFO    piecestore      download failed {"Piece ID": "MGJLK54AOITWVJBAHXF3LCUNDAXTD7RC226WFBUFYYESM72LB2IQ", "SatelliteID": "118UWpMCHzs6CvSgWd9BfFVjw5K9pZbJjkfZJexMtSkmKxvvAW", "Action": "GET_AUDIT", "error": "rpc error: code = NotFound desc = open config/storage/blobs/abforhuxbzyd35blusvrifvdwmfx4hmocsva4vmpp3rgqaaaaaaa/mg/jlk54aoitwvjbahxf3lcundaxtd7rc226wfbufyyesm72lb2iq: no such file or directory"}

Running

docker logs storagenode 2>&1 | grep GET_AUDIT | grep failed | grep open

I get the same thing.

Thanks

Dylan · August 21, 2019, 5:50pm

The developers are investigating that issue, looks like a bug but nothing wrong on your end. Will update when I have more info

sorry2xs · August 19, 2019, 5:03am

I have received 3 unrecoverable audits and 7 recoverable audits since last update and they are from satellite 118 is that a problem with my node Huston:fearful:

Alexey · August 20, 2019, 9:54pm

Yes. This looks like a problem with your storage.
Do you use a NFS or BTRFS?
Have you moved or reconfigured your node recently?
Have you replaced the -v option to the --mount?

Do you have a full log?

docker logs storagenode 2>&1 | grep GET_AUDIT | grep failed | grep open | grep 118

sorry2xs · August 21, 2019, 9:04am

hfs+ I did change from -v to --mount two days after I started the node 24 hours after network wipe two days after that I changed to --mount as of today I have almost 20000 successful audits and the seven unrecoverable still remain in that status and they are only from satellite 118, yes I have full logs of info only not debug they are big as I use the --follow argument

julez · August 21, 2019, 10:28pm

+1 - also noticed increased audit failures. I was wondering if it is to do with timeout setting on stopping\restarting docker container by watchtower?

Alexey · August 22, 2019, 12:46am

Have you changed it? We recently updated the watchtower parameters, please, update them: https://documentation.storj.io/getting-started/setup-a-storage-node#automatic-updates

julez · August 22, 2019, 7:54pm

Thanks @Alexey - no I hadn’t noticed the watchtower setting and changed. I have updated that now. The audit failures are for missing files\directories. Is this the bug discussed above or actual issues. It is less than 1% are failed unrecoverable audits.

Alexey · August 22, 2019, 9:02pm

It could be a bug. We need to have a full logs before and after the wipe. Nobody gave it to us yet.