i have always stopped this node with stop -t 300, besides when allowing watchtower to do it.
My network has been mint since the wipe, so i suspect that watchtower is causing missing pieces, but because i don’t dump my log files somewhere else yet, i can’t see when it was attempted to be downloaded.
storj@fractal:~$ sudo docker logs storagenode 2>&1 | grep R3DUH
2019-07-22T05:16:51.381Z INFO piecestore download started {"Piece ID": "R3DUH5HHLKGU5QO7BBJT6SLUXD77JVUL54GI7J77IH7TSBJQ3TBA", "SatelliteID": "121RTSDpyNZVcEU84Ticf2L1ntiuUimbWgfATz21tuvgk3vzoA6", "Action": "GET_AUDIT"}
2019-07-22T05:16:51.382Z INFO piecestore download failed {"Piece ID": "R3DUH5HHLKGU5QO7BBJT6SLUXD77JVUL54GI7J77IH7TSBJQ3TBA", "SatelliteID": "121RTSDpyNZVcEU84Ticf2L1ntiuUimbWgfATz21tuvgk3vzoA6", "Action": "GET_AUDIT", "error": "rpc error: code = NotFound desc = open config/storage/blobs/qstuylguhrn2ozjv4h2c6xpxykd622gtgurhql2k7k75wqaaaaaa/r3/duh5hhlkgu5qo7bbjt6sluxd77jvul54gi7j77ih7tsbjq3tba: no such file or directory"}
I can’t figure out what caused the high load. The critical audit alert appeared with high IOwait. I can’t figure out what the server was doing, other than maybe it was getting DOS’d outside, or the storj database was busy. Traffic didn’t increase, PUT/GETs started each minute held steady.
The r3 folder does not exist;
config/storage/blobs/qstuylguhrn2ozjv4h2c6xpxykd622gtgurhql2k7k75wqaaaaaa/r3/duh5hhlkgu5qo7bbjt6sluxd77jvul54gi7j77ih7tsbjq3tba
Hi since yesterday the volume of data and space stopped growing.
But still i don’t have a restriction of my BW and i far from ton reach my total size.
I have some informations like that on my log file : ERROR rpc error: code = NotFound desc = open config/storage/blobs/ukfu6bhbboxilvt7jrwlqk7y2tapb5d2r2tsmj2sjxvw5qaaaaaa/j6/dq5orrp6zwddrigik3jhns5n32pziwynp4mkraix56cwpqphfa: no such file or directory
But just a few
an i have this one too ERROR: 2019/07/24 05:03:51 pickfirstBalancer: failed to NewSubConn: rpc error: code = Canceled desc = grpc: the client connection is closing
What OS are you using and could you post the run command you are using.
The no such file or directory errors are serious. It means you are missing data. Make sure you are using the --mount syntax for your mounts and not -v. See https://documentation.storj.io/ for the full syntax, it’s different from the -v syntax.
Although my node hasn’t been offline, forcefully restarted or similar harmful things, I get the occasional unrecoverable audit failure because of “lost” files:
download failed {“Piece ID”: “7WWAHA5Y4P7F5GU5IXLNLJC2GNZR52IEYPYBNY2XTN4UMBCYD7JA”, “SatelliteID”: “12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S”, “Action”: “GET_AUDIT”, “error”: “rpc error: code = NotFound desc = open config/storage/blobs/ukfu6bhbboxilvt7jrwlqk7y2tapb5d2r2tsmj2sjxvw5qaaaaaa/7w/waha5y4p7f5gu5ixlnljc2gnzr52ieypybny2xtn4umbcyd7ja: no such file or directory”}
This seems to be out of my control and the failure rate is very low.
But my question are:
As I understand it from chats and other threads that soon even one unrecoverable audit failure will lead to a node being disqualified from that sattelite. Is there no way to recover from a failed audit or implement that a client periodically scans if it has all files it is supposed to have?
Will my node survive the (hopefully) few files that are missing or will the sattelites keep asking for these?
Of course it shouldn’t happen that I lose any file and I honestly don’t know why I should have lost it but that’s just what happened since the last upgrade. The file (or even directory) definitely doesn’t exist on my HDD and I’m quite sure it didn’t just delete itself.
Oh thanks a lot. Didn’t know that this was a known issue. Then I’ll just wait.
(Searching if it was ever uploaded is going to be a problem since I flush my logs regularly or maybe I should find it in the DB but I didn’t look into that)
I have received 3 unrecoverable audits and 7 recoverable audits since last update and they are from satellite 118 is that a problem with my node Huston:fearful:
Yes. This looks like a problem with your storage.
Do you use a NFS or BTRFS?
Have you moved or reconfigured your node recently?
Have you replaced the -v option to the --mount?
hfs+ I did change from -v to --mount two days after I started the node 24 hours after network wipe two days after that I changed to --mount as of today I have almost 20000 successful audits and the seven unrecoverable still remain in that status and they are only from satellite 118, yes I have full logs of info only not debug they are big as I use the --follow argument
Thanks @Alexey - no I hadn’t noticed the watchtower setting and changed. I have updated that now. The audit failures are for missing files\directories. Is this the bug discussed above or actual issues. It is less than 1% are failed unrecoverable audits.