Your Node Is Disqualified on the *** Satellite

h0w · May 23, 2020, 4:40pm

Hi,

Earlier today I received an email to upgrade my node software. The email stated:
You are currently on version 1.3.3 which is out of date and we don’t want your reputation impacted

So I upgraded my node using the same steps I always did (since 1 year):
docker stop storagenode
docker pull storjlabs/storagenode:beta
docker rm storagenode
docker run -d --restart unless-stopped … storagenode

I went from having served 276.6GB egress / 0.9TB ingress (2.6TB used, 417.1 GB Free)
to number zero everywhere. I didn’t think much of this.

Right now I received 2 emails with the same content:

Your Node Is Disqualified on the us-central-1 Satellite
Your Node Is Disqualified on the asia-east-1 Satellite
Your Node is Disqualified
Node ID: 1SetACRNrLyKNWGUYWFNXqi7FkKr4hEWZ3EvLjWJ67AeYabRAA
Your Storage Node on the us-central-1 satellite has been disqualified and can no longer host data on the network.
To store data on this satellite again, you’ll need to create a new node.

What went wrong here? What should I do next? Any advice on how to get qualified again? I have been hosting for 11 months and xx days

Thanks for the help.

BrightSilence · May 23, 2020, 5:13pm

Unfortunately disqualification is permanent. A couple of things could have happened, but most likely is you made a mistake in the run command somewhere and the node was looking for the stored data in the wrong place. I’m afraid this most likely means you’ll be disqualified on the remaining satellites as well. Perhaps you can prevent this if you stop the node, find out where it placed the data and copy the new blobs folder into the blobs folder it should have been. This could make the node survive on other satellites.
If you want to get data from all satellites though, you’d have to start a new node. You could run this new node in conjunction with the old one and eventually run graceful exit on the old one when the new node is vetted, but that is up to you.

Mark · May 23, 2020, 8:11pm

No suspension before getting disqualified? Scary that a node can go from being fine to permanently disqualified in such a short period of time with no warning.

BrightSilence · May 23, 2020, 8:32pm

Suspension only happens if the errors aren’t missing files errors. If your node has lost files there is no reason to trust it again to keep the files safe. Yes, it’s harsh, but there is no way for the satellite to differentiate between a misconfigured node or a node that has lost all data.

Mark · May 23, 2020, 8:47pm

I bet the storage node operator would be willing and able to differentiate between the two possibilities if they received a suspension email alerting them of the problem. It might save the storj team some data repair cost if the node operator just fixed the configuration issue.

If I ever accidentally configure my node in such a way, I would hope that my node or somebody at storj somehow alerted me so I don’t loose everything. I know a storagenode will refuse to start up if something as seemingly minor as the time/date is set wrong. Perhaps they could also make the node fail to start if the data is missing/inaccessible. At least then the operator might be alerted via email or dashboard that their node is offline and they could investigate and fix it.

Alexey · May 23, 2020, 9:20pm

The storagenode can’t determine has had it data or not if there is no access to the storage for any reason.
In such case it will fail to start.
If the storage path is accessible, but missed data, it will start from scratch.

Mark · May 23, 2020, 9:36pm

Since a node that does not see existing data assumes its new, I suppose there would need to be a way to inform the node that it’s not new and therefore should not be starting from scratch. It might require extra input from the user or some additional communication with the satellites.

h0w · May 24, 2020, 1:59am

Hi,

Thanks for the help. I just checked and can assure you the data has been there on a RAID 1 setup. It seems like STORJ had a problem loading the data.

Before the restart:

Last Contact ONLINE
Uptime       11h6m55s
                   Available       Used     Egress     Ingress
     Bandwidth           N/A     2.6 GB     1.0 GB      1.5 GB (since May 1)
          Disk      657.0 GB     1.8 TB

After the restart:

Last Contact ONLINE
Uptime       2m28s

                   Available       Used       Egress     Ingress
     Bandwidth           N/A     1.2 TB     276.6 GB      0.9 TB (since May 1)
          Disk      386.8 GB     2.6 TB

As you can see the Used/Available is all filled now like it was before yesterdays update, except the bandwidth section, a few hours after the update and also after being disqualified on:
Your Node Is Disqualified on the asia-east-1 Satellite (10 hours ago)
Your Node Is Disqualified on the us-central-1 Satellite (10 hours ago)
Your Node Is Disqualified on the europe-west-1 Satellite (9 hours ago)
Your Node Is Disqualified on the saltlake Satellite (3 hours ago)

Are there any remaining nodes or am I disqualified on all of the existing nodes?
I did not lose any data, and I think this should be considered as node being offline for about 11 hours.
As you can see it the node did see data, just less TB than before the restart.

Thanks for the help.

Mark · May 24, 2020, 2:04am

I think there are 6 satellites total. The two not in your list are stefan-benten and europe-north-1.

h0w · May 24, 2020, 2:07am

Thanks mark.
I still don’t understand why I would be disqualified because my node did show 1.8TB of data before the restart. So I assume it is somehow missing 0.8TB for about 11 hours. I’ll keep it running and see if any more emails popup. Could the STORJ team perhaps look into nodes being re-qualified for when something like this happens?

littleskunk · May 24, 2020, 9:31am

T: "2020-05-23T16:23:35.958Z" error: "file does not exist; piecestore: (Node ID: 1SetACRNrLyKNWGUYWFNXqi7FkKr4hEWZ3EvLjWJ67AeYabRAA, Piece ID: 67DWUDBTNVSUZQNJXC7FMPQL446HITHFAR3XUBVP2DDAKVW53DOA): file does not exist"

This should help you to avoid making the same mistake. For some reason your storage node is missing enough piece to get DQed.

Alexey · May 24, 2020, 12:58pm

Hello @h0w,
Welcome to the forum!

Please, show your full docker run command and remove any personal information.
Also, please, tell me, what is your OS?

h0w · June 10, 2020, 8:55am

Thanks all, you were right, my node was running on a partial mounted data folder, meaning it was missing 1 TB of data for about 12 hours. I will shutdown in the near future as my node has been DQ’ed on almost all satelites and receiving almost no traffic, after running for 1 year, maybe I’ll start over as suggested, don’t think this is justified because I still have all the data that was stored so far. My node has very good uptime and unmetered 1Gbps fiber connection so it’s kindof an unfortunate waste.

Last Contact ONLINE
Uptime       414h59m13s

               Available        Used     Egress     Ingress
 Bandwidth           N/A     15.4 GB     7.2 GB      8.2 GB (since Jun 1)
      Disk      542.0 GB      2.5 TB

Thanks for all the help!

BrightSilence · June 10, 2020, 10:53am

Unfortunately the satellite has no way of differentiating between you losing data or the node looking in the wrong path. It sucks that this happened, by try to use it as a learning experience. Don’t use the root of a mounted HDD as storage location. If you use a sub folder instead, the node won’t start if the path is not available. I hope this will help with your next node and you decide to give it another chance.

h0w · June 10, 2020, 11:14am

The instance started on a 2TB snapshot from a while ago, which was mounted on the wrong point.

Wouldn’t it be better for the node to sync missing data from other nodes in that case and provide the amount of missed bandwidth for free in return in the future? I would think that makes more sense instead of disqualification.

BrightSilence · June 10, 2020, 11:19am

There are 2 problems with that.

Repair is costly, not only do other nodes need to be paid for egress traffic, but the satellite would have to recreate the pieces you lost. Who will pay for that?
Your node lost a significant amount of data… why should the satellite trust it to keep data safe in the future with that track record?

Mark · June 10, 2020, 11:11pm

I find the whole trust argument amusing because in the end, the satellite actually will start to trust his node again as soon as he creates a new identity even though the hardware, the operator and the previous track record have not necessarily changed. Sure, he’d have to go through vetting again and he’ll loose his held back amount but it would be nice if he could rejoin with the old data from his old identity. I suspect, much of the old data could still be used as long as the satellites have not needed to repair it yet. Why not just restart the vetting on his old node/identity and let the satellites restart the timer on the held back amount and deduct any held back amount that was required for repairs of pieces that went below the repair threshold during his initial downtime and re-vetting period. Maybe pause payments during the re-vetting time period as well. On the other hand, the current system is probably easier for storj since it’s already implemented.

BrightSilence · June 11, 2020, 5:37am

Like you said, the new node would be vetted first. If preexisting issues weren’t fixed, the node would never get through that vetting phase and the risk of data loss is minimized by this new node having much less data.
The reason the satellite can’t ever accept the old node anymore, is because it audits only a small fraction of the data on each node. Enough to determine whether the node lost data, but not nearly enough to know which data was lost. So the satellite has no idea which pieces to repair. The only safe way to proceed is to assume everything is lost and repair everything when it falls below the repair threshold.
Additionally, there have to be real consequences to losing data. If there aren’t, node operators become careless.

That said, you can vote for this idea.

There have been several suggestions in that conversation of how to prevent the node being disqualified in the first place if a mount point isn’t available. Hopefully they will take a feature like that under consideration. I even posted an example there of a script you could use to automatically kill your node on any audit failure. Now that one is very aggressive and probably shouldn’t be used. But perhaps it can be altered to something more usable.

Alexey · April 10, 2021, 4:46am

4 posts were merged into an existing topic: Haven’t been payed since December