The error code is:
Your node has been disqualified on 12tRQrMTWUWwzwGh18i7Fqs67kmdhH9t6aToeiwbo5mfS2rUmo. If you have any questions regarding this please check our Node Operators thread on Storj forum.
I have had a node running for a few months and slowly getting more data, up to about 700GB. Two days ago I found that the nvme that my node was running on failed, so that it was responsive but reporting as ‘offline’, and not just that but also failing audits (apparently). The actual data is on a mirrored ZFS drive and should be completely whole.
My banning is permanent, but only on one validator? What’s my best move here? I’m not very enthused to start over with such a risk of losing all earnings with a single day or two of downtime.
I think I have only a dollar or two owed to me, but more important is the time spent building reputation.
Disqualification is due to data loss or corruption, not just “a single day or two of downtime”. If the only issue was downtime then your node has many days to resolve the problem with suspension of new ingress as the first penalty, and then disqualification only after 30 days if the downtime was over 12 days.
- As long as you are sure the data is now corruption free then you can continue to run the node and it will work as normal on the other satellites (validators).
- If you are unsure of the data quality then running QE now could recover the held back amount rather than let the node disqualify on the remaining satellites.
Why would only a single satellite disqualify me? Four of the satellites have audit scores of 100%, one (asia-east-1) has a score of 96.32%, and the last (us2) has a score of 59.87%. This has stayed static since I moved my node to a new hard drive. So I think I can assume that the data is good, other satellites surely must have audited it since and my score hasn’t changed. us2 must have done a lot of auditing during the day my node was in a broken state (responding but failing audits), asia-east-1 only did a little? I don’t understand what governs the rates of audits, I assume that it’s random, but seeing more audit activity from us2 does not look random.
I’m certainly an edge case. This illustrates a dissimilar treatment of similar error states:
- No response at all
- Responding “I don’t have the data”
Since the state my node was in was only temporary, despite responding that I didn’t have the data, it should be treated more like 1, which as you stated has a 12-day grace period.
I think the easiest fix would have the software shut itself down when it can’t find data it’s being audited for, since something is obviously wrong. It will still be disqualified if it truly lost the data, just after a longer grace period that storj has already judged to be acceptable for offline nodes (and acceptable for the health of the network).
i think this is an issue that should be fixed…
that one can timeout into DQ is what i would consider a software failure.
not saying it can help you @aswang but it might help others in the future.
a hardware watch dog can avoid such issues, which reminds me i should really get mine turned back on when i reboot next time, and i should really get around to get my OS on a mirrored SSD
not sure if that satellite has much data, it kinda sucks to lack a satellite… but it might not mean much short term atleast… so eventually you might want to start a new node to replace it…
yeah i got no data to us2, might change one day… but atleast from 13 months ago and until today i basically has no traffic or data stored from the satellite you have gotten DQ on…
which is also most likely why it got DQ so fast… not much data to audit and so a few failed audits matters a lot.
i would just keep running the node… i see no reason to presently do anything.
Thanks, and good point. us2 represents a very small percentage of my data. I might just ignore it and shoulder on.