Failed audit on new satellite according to api but nothing in logs

rml52 · April 21, 2020, 6:12pm

Mine is a Dell Pweredge R510 Server with SAS drives

xyphos10 · April 21, 2020, 7:14pm

mine is connected via sata

rml52 · April 21, 2020, 8:11pm

So with my limited knowledge of logs the first one was downloaded successfully to my node. The second one was not downloaded to my node because I lost the race and context was cancelled. Correct?

Then both were audited once. Should the second one have been audited as I didn’t actually download it?

Are there more searches or results you need?
Is this a satellite issue then?

littleskunk · April 21, 2020, 8:43pm

Was not downloaded successfully. The success message is missing.

The second upload was canceled but that doesn’t mean you lost the race. Note: There is a false negative rate on canceled uploads. If a upload was successful you have won the race. If the upload was canceled there is still a small chance that you have won the race on the last second but didn’t receive the confirmation. Depending on when the upload was canceled the storage node will keep the piece and wait for the next garbage collection to find out if it has won the race or not.

I can’t really do much here. From that perspective I don’t need anything else. I have dopped the ball and hope that someone else can dig deeper.

No. Your storage node has returned an error and we can’t blame the satellite for that. If someone wants to dig deeper I would suggest looking into the storage node. Why is it not printing out the audit error? That should be a simple improvement to get a better understanding of what is going on here. The real question is why is the database locked and how can we prevent that. (My last words before unsubscribing from this topic)

rml52 · April 21, 2020, 8:56pm

@Alexey Are you available to help me troubleshoot my node and see if I have an issue to discover why my node (and other peoples nodes) as failed audits on the new satellite but to my knowledge hasn’t failed before?

Do you have log search commands or anything you would like me to run?
From reading replies the dashboard can be misleading towards actual failed audits but the satellite logs show my node and a couple others are showing database locked.
I have not rebooted or touched my node since update to 1.1.1
To my knowledge, without knowing good log searches, I have not failed any other audits on other sats.
When I run successrate.sh it says 100% no issues.

Alexey · April 21, 2020, 9:52pm

I do not have any besides mentioned by you.

I can suggest to replace the info log level to debug in the config.yaml and restart the storagenode (not recreate, just restart)

docker restart -t 300 storagenode

In that case if it would happen again, then we could have more info.

P.S. I do not have database is locked error for any databases except orders.db.

rml52 · April 21, 2020, 10:59pm

So I looked around in logs the best I could.
I grep’d all audits from the new sat and found the two skunk mentioned, also the only two that didn’t finish.
I grep’d the time those happened and see no abnormal log entry around that time

I grep’d the word “locked”, and only see ordersdb and once a bandwidthdb. I did not find any log about any other DB being locked.

What will the debug logs show that could help us with this?
Is it ok to leave it in debug for extended time (day or 2 or more) to see if it happens again?
the first one happened at 2020-04-20T16:56:08.306
the second one happened at 2020-04-21T00:56:17.938

Seeing that I have 212 other successful audits, it wouldn’t make me think that any of my DBs are locked or have issues - so was my node just too busy to respond?
Does a download or upload for customer take a DB lock while it does its thing and this was a process timing issue?

Alexey · April 22, 2020, 8:44am

More looks like your drive is too slow to finish working with a database and it still in locked state when the audit is requested.
Usually such happens when the storage is a network-connected drive or USB2.0 or SMR.

rml52 · April 22, 2020, 11:32am

Are the time stamps local time or UTC?
I only have 2 VMs on the server - Storj and a Windows 10 File Share
I totally could have been doing something with a file at 16:56 EST but I would have been long asleep at 00:56 EST.
Well doesn’t sound like there is anything I can do about it.
I will see if it happens again and if so turn on debug.

nerdatwork · April 22, 2020, 12:05pm

^ you are correct its UTC

xyphos10 · April 23, 2020, 3:45pm

Hopefully the update with better logging because just got a suspension email, upon review it is the same situation where and audit started but did not complete. I checked and the piece does exist in my hard drive. The hard drive is not SMR but it is connected via USB 3.0

nerdatwork · April 23, 2020, 4:01pm

How is your HDD usage and CPU usage in % ?

xyphos10 · April 23, 2020, 4:14pm

IO

CPU

nerdatwork · April 23, 2020, 5:25pm

Check this KB

Look for Can I share an external disk drive or other sources?