Help determining disqualification reason please

marcaccioc · August 24, 2020, 2:49am

been running for 6 months (on windows)- thought all was well until I got DSQ from Saltlake last night.
I am at 100% for all satellites except obviously saltlake. I have seen saltlake goto 85% on the dashboard before but it always went back up. I could never understand what was happening.
I have searched the logs for GET_AUDIT and failed - nothing returned.
I have manually looked through the log and found a database issue - ordersdb error: database disk image is malformed". From what I’ve read though, that shouldn’t cause DSQ. Also, it is still happening after the saltlake DSQ and the other satellites remain at 100%.
Even my audits of my success were doing well. all above 80%
Any suggestion as to what I can look for to determine the cause of DSQ?
From what I’ve read, is that it is now better to start over since saltlake is the main satellite. Is there any point to keeping the other nodes running for 6 more months and then Gracefully Exit?
If I do start over, I will need to figure out why this happened before hand though.
Is it true, I lose all the held amount for saltlake?
What happens to the 5TB of data stored? Does it eventually move to trash so the space can be recovered?
thanks

littleskunk · August 24, 2020, 5:53am

You can also get DQed for not responding to audit requests. In your logfile you would see an audit started message and 5 minutes later maybe a canceled message? Even if there is no canceled message you can simply count the number of started vs success messages. You will also notice that you get 3 times the same pieceID. That is a very good indicator that you are unable to deliver the data in time for what ever reason.

BrightSilence · August 24, 2020, 8:42am

I agree with littleskunk. Start your investigation there and try to find out what went wrong. I’m thinking IO congestion may have played a part.

As for starting over, that really depends on a few factors. Either way I would keep this node as it is healthy on all other satellites. If you don’t have a lot of free space to share, the remaining satellites will fill that up soon enough and there really is no reason to start a new node at all.

If you do have a lot of free space, you may want to get back into business with Saltlake. In that case I recommend starting a second node (of course only after you figured out what went wrong). And at least wait until that second node is vetted on all satellites (with the exception of stefan-benten, which is currently not seeing much action). Once it is vetted you can reduce the allocated size of your original node to the minimum of 500GB, so all the new data goes to the new node. You can then run graceful exit on the old one at any time if you want or just keep it up. If they share the same HDD, I recommend eventually exiting the old node and reducing it back to just one. If they run on their own HDD, you should probably just keep them both running.

marcaccioc · August 24, 2020, 4:45pm

Thanks for both of your replies.
I’ve been unable to find an audit that doesn’t complete in the logs.
I did find a couple other errors - Download Failed about once per hour and Upload failed about 10 times per day.
The Upload errors say “Context Deadline Exceeded” and usually comes about 30 minutes after the upload was started.
The Download errors say “An existing connection was forcibly closed by the remote host.” These are usually tried again and are successful.
Also found “Failed to Add Bandwidth - bandwidthdb database locked” error messages.
and found “Failed to Add Order - ordersdb database locked” error messages.
So clearly my node was not as healthy as I thought it was.
Would these database issues cause DSQ? They are still happening after saltlake DSQ.
thanks.

BrightSilence · August 24, 2020, 5:08pm

Not directly, but they could be a sign of system slow down, which can cause DQ if you respond too slowly to audits as well. I would expect there to be some logging of that though. What HDD model are you using and how is it connected?

marcaccioc · August 24, 2020, 7:05pm

I have several drives in a pool all connected by SATA connections. One of the drives I believe is an SMR drive. So maybe that is the issue? However, all the audits i took a look at in the logs, appeared to be responded to within 1 second. I only looked at the audits in the day leading up to DSQ.

BrightSilence · August 24, 2020, 7:49pm

I’m not sure why you’re not seeing the immediate cause of the audit failures, but the database locked errors you see tend to happen on IO constrained systems. What kind of pool are we talking about here?

Either way, you’re likely a lot better off by running separate nodes on individual HDDs. Other users have had success with that method to reduce the load on an SMR disk by spreading the amount of work it has to deal with. With pools the SMR disk can slow down the entire pool and basically cause the other HDDs to have to wait for the slow one.