Missing Piece before the wipe? - Critical audit alert

julez · August 22, 2019, 9:22pm

Yep - I’ve removed containers a number of times so have lost logs. Whats the best way to retain these between restarts or when recreating the container?

Alexey · August 22, 2019, 9:25pm

And this question is described there: https://documentation.storj.io/resources/frequently-asked-questions#how-do-i-redirect-my-logs-to-a-file

I really suggest you to read this tutorial

julez · August 22, 2019, 9:27pm

Thanks - will do it now.

KernelPanick · August 22, 2019, 9:35pm

@BrightSilence has them

BrightSilence · August 22, 2019, 9:54pm

I just sent you a link in the rocket.chat to logs going back to June 14th, well before the wipe. I don’t have all logs, but I doubt anyone does. I only got one audit for a piece I never received (at least since June 14th), I mentioned the missing piece ID and my node ID as well. Let me know if there is more I can give you to help with this.

vanees · August 24, 2019, 8:11am

Node id is 129qUGoVijekDhhj6MVEwcQCrMQsv6DZYoAUay66c9MyzM65qjp
Yesterday i noticed 1 Unrecoverable audit .

stat

========== AUDIT =============
Successful: 736
Recoverable failed: 1
Unrecoverable failed: 1
Success Rate Min: 99.729%
Success Rate Max: 99.864%
========== DOWNLOAD ==========
Successful: 0
Failed: 9
Success Rate: 0.000%
========== UPLOAD ============
Successful: 28
Rejected: 0
Failed: 6
Acceptance Rate: 100.000%
Success Rate: 82.353%
========== REPAIR DOWNLOAD ===
Successful: 0
Failed: 0
Success Rate: 0.000%
========== REPAIR UPLOAD =====
Successful: 0
Failed: 0
Success Rate: 0.000%

docker logs storagenode 2>&1 | grep GET_AUDIT | grep failed | grep open | grep 118

2019-08-23T21:39:18.551Z        INFO    piecestore      download failed {"Piece ID": "HBEQWRHL4IFCPIBBZU2M4ZFLBZZYWOJ7X6U33CGOJITVXDU5WTPQ", "SatelliteID": "118UWpMCHzs6Cv
SgWd9BfFVjw5K9pZbJjkfZJexMtSkmKxvvAW", "Action": "GET_AUDIT", "error": "rpc error: code = NotFound desc = open config/storage/blobs/abforhuxbzyd35blusvrifvdwmfx4hmocsva4vm
pp3rgqaaaaaaa/hb/eqwrhl4ifcpibbzu2m4zflbzzywoj7x6u33cgojitvxdu5wtpq: no such file or directory"}

I run my node in Docker, that is in unpriv. LXC container on proxmox host.
node was recreated recently with --mount
All settings are in accordance with the instructions from here https://documentation.storj.io/setup/storage-node
Is my node fine?

Alexey · August 24, 2019, 12:37pm

Hello @vanees,
Welcome to the forum!
I have added your information to the issue.
Do you have a full logs from before the July 15?

vanees · August 24, 2019, 3:49pm

Unfortunately i don`t have.

Odmin · August 25, 2019, 6:53am

I got the same issue during last few days:

Node ID: 12LcaN37vw6V3HSm7c9zrLQEWBgDFTzmm6FFihWQoLrmfNtbygq

root@docker02:/home/odmin# ./AuditsSatellites.sh
Sat ID: 118UWpMCHzs6CvSgWd9BfFVjw5K9pZbJjkfZJexMtSkmKxvvAW
Unrecoverable Failed Audits: 1
Recoverable Failed Audits: 1
Successful Audits: 4179
Sat ID: 121RTSDpyNZVcEU84Ticf2L1ntiuUimbWgfATz21tuvgk3vzoA6
Unrecoverable Failed Audits: 0
Recoverable Failed Audits: 0
Successful Audits: 3324
Sat ID: 12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S
Unrecoverable Failed Audits: 0
Recoverable Failed Audits: 0
Successful Audits: 3036
Sat ID: 12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs
Unrecoverable Failed Audits: 0
Recoverable Failed Audits: 0
Successful Audits: 4076

this node have all logs starting from last update:
docker logs storagenode 2>&1 | more

Try to find issue:

docker logs storagenode 2>&1 | grep GET_AUDIT | grep 'download failed' | awk '{print $8 $10}' | sort | uniq -c | sort -g -r

1 “PD7HXPQG2QTXN3FZU2LWA4LP57DPZ4VT5DSJTSESVP5WCF64TBWA”,“118UWpMCHzs6CvSgWd9BfFVjw5K9pZbJjkfZJexMtSkmKxvvAW”,
1 “B57MQUMGJT6WOC3ZD4LKPNXZ53NL3JHZ4WLCBKCBZPA2XVMZFPIA”,“118UWpMCHzs6CvSgWd9BfFVjw5K9pZbJjkfZJexMtSkmKxvvAW”,

docker logs storagenode 2>&1 | grep "PD7HXPQG2QTXN3FZU2LWA4LP57DPZ4VT5DSJTSESVP5WCF64TBWA"

2019-08-24T01:00:28.602Z INFO piecestore download started {“Piece ID”: “PD7HXPQG2QTXN3FZU2LWA4LP57DPZ4VT5DSJTSESVP5WCF64TBWA”, “SatelliteID”: “118UWpMCHzs6CvSgWd9BfFVjw5K9pZbJjkfZJexMtSkmKxvvAW”, “Action”: “GET_AUDIT”}
2019-08-24T01:00:28.602Z INFO piecestore download failed {“Piece ID”: “PD7HXPQG2QTXN3FZU2LWA4LP57DPZ4VT5DSJTSESVP5WCF64TBWA”, “SatelliteID”: “118UWpMCHzs6CvSgWd9BfFVjw5K9pZbJjkfZJexMtSkmKxvvAW”, “Action”: “GET_AUDIT”, “error”: “rpc error: code = NotFound desc = open config/storage/blobs/abforhuxbzyd35blusvrifvdwmfx4hmocsva4vmpp3rgqaaaaaaa/pd/7hxpqg2qtxn3fzu2lwa4lp57dpz4vt5dsjtsesvp5wcf64tbwa: no such file or directory”}

Odmin · August 25, 2019, 7:16am

Another example from another node from different location:

Node ID: 12EEcfwBgnh6K5t1CtSHDyoauLaJ4nAsWPs9Z2zNJ7tvzdZBhhU

root@docker03:~/odmin# ./AuditsSatellites.sh
Sat ID: 118UWpMCHzs6CvSgWd9BfFVjw5K9pZbJjkfZJexMtSkmKxvvAW
Unrecoverable Failed Audits: 2
Recoverable Failed Audits: 4
Successful Audits: 4110
Sat ID: 121RTSDpyNZVcEU84Ticf2L1ntiuUimbWgfATz21tuvgk3vzoA6
Unrecoverable Failed Audits: 0
Recoverable Failed Audits: 0
Successful Audits: 3731
Sat ID: 12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S
Unrecoverable Failed Audits: 0
Recoverable Failed Audits: 0
Successful Audits: 2697
Sat ID: 12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs
Unrecoverable Failed Audits: 0
Recoverable Failed Audits: 0
Successful Audits: 3592

this node have all logs starting from last update:
docker logs storagenode 2>&1 | more

Try to find issues:
docker logs storagenode 2>&1 | grep GET_AUDIT | grep 'download failed' | awk '{print $8 $10}' | sort | uniq -c | sort -g -r

  1 "RILONCFWNK64LDTYLCOIO63H43TYLLORIKBJGWGBI5TRXS675KIQ","118UWpMCHzs6CvSgWd9BfFVjw5K9pZbJjkfZJexMtSkmKxvvAW",
  1 "JL7G2BUWPH2CBSX4E2G37JJCRKXK2XAV2BO2LVUOQ2HYNCAZLD6Q","118UWpMCHzs6CvSgWd9BfFVjw5K9pZbJjkfZJexMtSkmKxvvAW",
  1 "H2RKSGPC237AVS3UWHQRFLIG3KJAJXUQ5F76JIREJHVILY36JI4A","118UWpMCHzs6CvSgWd9BfFVjw5K9pZbJjkfZJexMtSkmKxvvAW",
  1 "BCSTGVRAHLBR52542WYVQKNSBNNUPPAI5HDAZKQRBQHETFQPP7KA","118UWpMCHzs6CvSgWd9BfFVjw5K9pZbJjkfZJexMtSkmKxvvAW",
  1 "76D2Z6UU5ETJP6HYO647YIMB4OOH4X6MOYEBWYQZFQLZ453YYGGQ","118UWpMCHzs6CvSgWd9BfFVjw5K9pZbJjkfZJexMtSkmKxvvAW",
  1 "672V3EWD62DN4H6XNCNW3X3U7JP6RIWHBWJZI75FHUVZLHTIK5TA","118UWpMCHzs6CvSgWd9BfFVjw5K9pZbJjkfZJexMtSkmKxvvAW",

docker logs storagenode 2>&1 | grep "H2RKSGPC237AVS3UWHQRFLIG3KJAJXUQ5F76JIREJHVILY36JI4A"

2019-08-17T14:43:17.332Z INFO piecestore download started {“Piece ID”: “H2RKSGPC237AVS3UWHQRFLIG3KJAJXUQ5F76JIREJHVILY36JI4A”, “SatelliteID”: “118UWpMCHzs6CvSgWd9BfFVjw5K9pZbJjkfZJexMtSkmKxvvAW”, “Action”: “GET_AUDIT”}
2019-08-17T14:43:17.340Z INFO piecestore download failed {“Piece ID”: “H2RKSGPC237AVS3UWHQRFLIG3KJAJXUQ5F76JIREJHVILY36JI4A”, “SatelliteID”: “118UWpMCHzs6CvSgWd9BfFVjw5K9pZbJjkfZJexMtSkmKxvvAW”, “Action”: “GET_AUDIT”, “error”: “rpc error: code = NotFound desc = open config/storage/blobs/abforhuxbzyd35blusvrifvdwmfx4hmocsva4vmpp3rgqaaaaaaa/h2/rksgpc237avs3uwhqrflig3kjajxuq5f76jirejhvily36ji4a: no such file or directory”}

docker logs storagenode 2>&1 | grep "BCSTGVRAHLBR52542WYVQKNSBNNUPPAI5HDAZKQRBQHETFQPP7KA"

2019-08-22T00:43:50.098Z INFO piecestore download started {“Piece ID”: “BCSTGVRAHLBR52542WYVQKNSBNNUPPAI5HDAZKQRBQHETFQPP7KA”, “SatelliteID”: “118UWpMCHzs6CvSgWd9BfFVjw5K9pZbJjkfZJexMtSkmKxvvAW”, “Action”: “GET_AUDIT”}
2019-08-22T00:43:50.099Z INFO piecestore download failed {“Piece ID”: “BCSTGVRAHLBR52542WYVQKNSBNNUPPAI5HDAZKQRBQHETFQPP7KA”, “SatelliteID”: “118UWpMCHzs6CvSgWd9BfFVjw5K9pZbJjkfZJexMtSkmKxvvAW”, “Action”: “GET_AUDIT”, “error”: “rpc error: code = NotFound desc = open config/storage/blobs/abforhuxbzyd35blusvrifvdwmfx4hmocsva4vmpp3rgqaaaaaaa/bc/stgvrahlbr52542wyvqknsbnnuppai5hdazkqrbqhetfqpp7ka: no such file or directory”}

Also lets look into Recoverable audit:

docker logs storagenode 2>&1 | grep "RILONCFWNK64LDTYLCOIO63H43TYLLORIKBJGWGBI5TRXS675KIQ"

2019-08-19T05:18:07.343Z INFO piecestore download started {“Piece ID”: “RILONCFWNK64LDTYLCOIO63H43TYLLORIKBJGWGBI5TRXS675KIQ”, “SatelliteID”: “118UWpMCHzs6CvSgWd9BfFVjw5K9pZbJjkfZJexMtSkmKxvvAW”, “Action”: “GET_AUDIT”}
2019-08-19T05:18:07.363Z INFO piecestore download failed {“Piece ID”: “RILONCFWNK64LDTYLCOIO63H43TYLLORIKBJGWGBI5TRXS675KIQ”, “SatelliteID”: “118UWpMCHzs6CvSgWd9BfFVjw5K9pZbJjkfZJexMtSkmKxvvAW”, “Action”: “GET_AUDIT”, “error”: “piecestore: piecestore protocol: rpc error: code = DeadlineExceeded desc = context deadline exceeded”, “errorVerbose”: “piecestore: piecestore protocol: rpc error: code = DeadlineExceeded desc = context deadline exceeded\n\tstorj.io/storj/storagenode/piecestore.(*Endpoint).Download.func4:539\n\tstorj.io/storj/storagenode/piecestore.(*Endpoint).Download:562\n\tstorj.io/storj/pkg/pb._Piecestore_Download_Handler:727\n\tstorj.io/storj/pkg/server.(*Server).logOnErrorStreamInterceptor:23\n\tgoogle.golang.org/grpc.(*Server).processStreamingRPC:1134\n\tgoogle.golang.org/grpc.(*Server).handleStream:1185\n\tgoogle.golang.org/grpc.(*Server).serveStreams.func1.1:703”}

All two nodes not have issues like this:

docker logs storagenode 2>&1 | grep -B 10 -A 10 "serial number is already used: infodb: database is locked" | tail

So I think satellite 118 is asking not existing data from the nodes

Alexey · September 15, 2019, 11:55am

2 posts were split to a new topic: Missing link to the documentation

Sasha · September 15, 2019, 11:38am

@Alexey, It looks like I spoke too soon. The issue with 1 failed audit repeating has come back and possibly while I was putting node into shutdown mode to remove the node.log entry from to config file (see other thread regarding node.log file not updating, which I wasn’t able to replicate a second time so I removed the config entry).

I had no failed audits prior to this change (saved a copy of the full node.log file prior to removing the setting entry from config).

6 hours later for a quick log check I found 1 failed audit from satellite 118U. I’ve checked the new logs and found no upload entry for this piece id. There is only a failed audit entry.

sudo docker logs storagenode 2>&1 | grep GET_AUDIT | grep "download failed"
2019-09-15T11:10:58.406Z        INFO    piecestore      download failed {"Piece ID": "ZM3HYYURHZWNRSKGL46FGIOZ4GHDSPVHTW3GAEIVSPKJ5F5S7IKA", "SatelliteID": "118UWpMCHzs6CvSgWd9BfFVjw5K9pZbJjkfZJexMtSkmKxvvAW", "Action": "GET_AUDIT", "error": "rpc error: code = NotFound desc = file does not exist"}

 sudo docker logs storagenode 2>&1 | grep "ZM3HYYURHZWNRSKGL46FGIOZ4GHDSPVHTW3GAEIVSPKJ5F5S7IKA"
2019-09-15T11:10:58.405Z        INFO    piecestore      download started        {"Piece ID": "ZM3HYYURHZWNRSKGL46FGIOZ4GHDSPVHTW3GAEIVSPKJ5F5S7IKA", "SatelliteID": "118UWpMCHzs6CvSgWd9BfFVjw5K9pZbJjkfZJexMtSkmKxvvAW", "Action": "GET_AUDIT"}
2019-09-15T11:10:58.406Z        INFO    piecestore      download failed {"Piece ID": "ZM3HYYURHZWNRSKGL46FGIOZ4GHDSPVHTW3GAEIVSPKJ5F5S7IKA", "SatelliteID": "118UWpMCHzs6CvSgWd9BfFVjw5K9pZbJjkfZJexMtSkmKxvvAW", "Action": "GET_AUDIT", "error": "rpc error: code = NotFound desc = file does not exist"}

My audit score was 1 for this satellite, due to this new issue it’s now down to 0.69

I have a feeling the docker stop command may not be gracefully shutting down the node OR; the satellite which never received confirmation that a piece was uploaded to my node thinks it’s uploaded.

Can 1 of the data engineers investigate?

nerdatwork · September 15, 2019, 5:58pm

When shutting down your node use

docker stop -t 300 storagenode

The 300 is number of seconds for your node to gracefully shut down.

Odmin · September 15, 2019, 8:16pm

@Sasha Please look into this post for prevent this situation in future.

Sasha · September 15, 2019, 10:45pm

Hi @Odmin, I use the Docker UI to stop the container, however Watchtower is still running. It takes approx between 5 seconds and 60 seconds to stop the storagenode container.

When Docker UI sends a stop command to the container how does the satellite uploading data to the node know the storagenode is about to go offline? If it doesn’t know and simply disappears offline then setting a shutdown to 10, 30, 300 or 500 seconds wouldn’t make a difference would it?

I’m happy to try and re-create this and break it by stopping and starting the node, but I would need dev to implement some additional logging to try and capture the exact cause and behavior of what happens.

Note: Using Synology NAS with Docker.

cdhowie · September 16, 2019, 4:25am

It does seem very odd that the advice here is “use a higher timeout when stopping the container.” If the storagenode software is correctly written, you should even be able to yank the power cable without this happening: the node should not return success to the uploader/satellite until the data has been properly committed to storage (fsync(), for example, across all relevant files, as well as any database writes necessary to catalog the presence of the piece).

If this is correctly done, it should not matter whether the container is terminated gracefully or not – either the piece is there, or the uploader should never have received confirmation that the upload to this node was successful.

So…

If storagenode handles this situation properly, the advice to use -t 300 isn’t helpful at all and is actually harmful as it disguises the real issue by incorrectly placing blame on the node operator, and discouraging operators/developers from looking for the real problem.
If storagenode does not handle this situation properly, this is a serious bug that should be fixed ASAP.

Either way, it’s not something the node operator should be responsible for.

This would be like saying that, if a DB server returns success to a transaction commit request, the DBA stops the server in the middle of this process, and the transaction is actually lost, that it’s the DBA’s fault – no, the DB server software should not have returned success until the changes were guaranteed to have been persisted in some form.

Sasha · September 16, 2019, 4:46am

Yes, that is why I am wondering this myself.
Even in a power outage scenario the UPS will kick in and is instructed to immediately start shutting down the NAS to ensure work is saved but this shutdown may not be handles correctly by the storagenode docker and I really am wondering if this could be the culprit of data never uploaded while satellite thinks it has uploaded type of scenario.

What is odd though, is that logs don’t show an actual request to upload the piece id in my recent example but there was an audit request for the piece id.

Odmin · September 16, 2019, 4:57am

Hi @Sasha
Satellite correctly handle node shutdown, but Docker default setting have only 10sec timeout for any container, on heavy load it not enougth, and container can be killed during dowload (PUT) and you will have udefined status of upload. I changed this default timeout and not have issues anymore and share this information on provided post.

Also, I have situation with UPS shutdown signal, that also can be fixed with extending docker default timeout.

Sasha · September 16, 2019, 11:06am

Thanks @Odmin , but the question was does it handle an abrupt storagenode going offline? As @cdhowie made a good point. Think about worst case scenarios where SNO’s shutdown the node in 10 seconds, or a power loss occurred, or LAN/internet connectivity issue occurred, etc?

In such cases does the satellite correctly behave as expected? Or are we onto something here and this could be the cause of a satellite trying to download an audit for a Piece ID that was never uploaded successfully to the storage node in the first place?

Odmin · September 16, 2019, 11:27am

Hi @Sasha
I talking only about abnormal container termination (container kill) and pay attention only for it.
All other scenarios successfuly handled by storagenode and satellite, you can check it and share your results (I alredy tested it for himself and share solution with you)