Got Disqualified from saltlake

My node finally disqualified on three satellites because of audit timeouts.
And I can confirm - nothing in logs what should warn me about ongoing issue. The dashboard doesn’t help either - the reputation updated not in real time, so it’s too late.

1 Like

Which satellites? My guess is 2 of them is Saltlake and Europe-North?

Saltlake, EU1, Europe-North-1
Interestingly that node registers successful audits from them until disqualification.
AP1, both Americas are affected to, but ~96%

I think I misread this last time I responded. I thought you asked about when a node operator had deleted a piece by accident. If a customer has deleted a piece the audit worker should already check metadata to see if the segment still exists and hasn’t expired. So it should already stop auditing the same file. But even if that isn’t the case, if not enough nodes respond with the correct piece to recreate it (so less than 29 nodes respond with a correct piece) the audit failure won’t count against your score. This was recently implemented to prevent issues when deleted or expired pieces are audited incorrectly. So in the scenario you describe there should already be 2 systems in place to prevent the node from being impacted. If the fallback triggers for some reason, you will see a failed audit in the log though, but it won’t count against your audit score.

1 Like
$ cat /mnt/x/storagenode3/storagenode.log | grep 1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE | grep -E "GET_AUDIT" | jq -R '. | split("\t") | (.[4] | fromjson) as $body | {SatelliteID: $body."Satellite ID", ($body."Piece ID"): {(.[0]): .[3]}}' | jq -s 'reduce .[] as $item ({}; . * $item)'
    "2021-08-22T22:42:40.601Z": "download started",
    "2021-08-22T22:50:28.975Z": "download started",
    "2021-08-22T23:04:59.672Z": "download canceled",
    "2021-08-22T23:11:19.397Z": "download started",
    "2021-08-22T23:15:09.613Z": "downloaded",
    "2021-08-22T23:42:01.821Z": "download started",
    "2021-08-23T00:04:21.554Z": "downloaded",
    "2021-08-23T00:30:52.619Z": "download started",
    "2021-08-23T00:35:27.385Z": "downloaded",
    "2021-08-23T00:37:23.375Z": "download started",
    "2021-08-23T01:32:05.631Z": "download started",
    "2021-08-23T01:34:09.934Z": "download started",
    "2021-08-23T01:38:56.373Z": "downloaded",
    "2021-08-23T01:56:29.461Z": "download started",
    "2021-08-23T02:09:35.964Z": "download started",
    "2021-08-23T02:10:28.193Z": "download started",
    "2021-08-23T02:19:04.072Z": "download started",
    "2021-08-23T02:37:11.049Z": "download started",
    "2021-08-23T02:40:59.384Z": "downloaded",
    "2021-08-23T03:07:47.499Z": "downloaded",
    "2021-08-23T03:13:37.528Z": "downloaded",
    "2021-08-23T03:25:25.152Z": "downloaded",
    "2021-08-23T03:31:02.822Z": "downloaded",
    "2021-08-23T03:52:26.973Z": "downloaded",
    "2021-08-23T04:01:59.479Z": "downloaded",
    "2021-08-23T04:17:16.239Z": "downloaded"

In such format it’s now obvious - there was a problem. The interval between “GET_AUDIT” “download started” and “downloaded” should not be greater than 5 minutes for each piece.
And there is problem started:

    "2021-08-21T20:38:47.156Z": "download started",
    "2021-08-21T20:38:49.832Z": "downloaded"
    "2021-08-21T20:56:08.230Z": "download started",
    "2021-08-21T20:57:07.722Z": "downloaded"
    "2021-08-21T20:59:45.667Z": "download started",
    "2021-08-21T21:04:56.621Z": "downloaded"
    "2021-08-21T21:28:50.623Z": "download started",
    "2021-08-21T21:28:54.659Z": "downloaded"

Instead of 2 seconds, it gave a piece after 4 minutes. Later longer and longer

I figured the repair worker would only return errors it’s certain about, it seems to be even more specific than that. So that’s good to see that verified. This shouldn’t be an issue for the solution we’ve been discussing then.

That’s interesting, so at least in your case the node was significantly slowing down before it stopped working. I wonder if it could use some of its own telemetry to detect something is wrong before it gets completely unresponsive. Something like if the average response time in the last 5 minutes is more than 10x the normal response time, kill the node to protect itself. We can’t really be sure that behavior is always like this though. It’s possible some other nodes pretty much become unresponsive instantly.

It has not stopped to work. The hanging is gone itself the same way as started.
I checked the disk and there were minor filesystem corruption.
I forced to continue to use NTFS after migration to Ubuntu (I do not have a free space to convert the disk).
So it’s kind of expected. I just hoped that the latest NTFS driver would be better than I remember from the last time when I tried to use NTFS in Linux with relative high disk load (it were torrents on that time).
But nothing changed. NTFS is still a second-class citizen on Linux.

The good thing - now I know the storagenode can detect that timeout and can shutdown, since it’s actually continued to run normally, just disk become temporary slow on respond.
Now we need a change (Pull Request) for that.

1 Like

Tell me about it… I still have a 2nd gen Drobo connected to my Synology. It technically supports ext3, but it already had a thin provisioned NTFS volume and you can’t just use external tools to convert that. The official method would require wiping the entire thing and adding the ext3 volume. So I’m still running it on NTFS over USB2, with SATA II connections internally. To be honest, I’m surprised I never ran into such slow response issues on that node yet. I guess it might be because it is relatively small. But even on a 1.75TB node, the garbage collection runs take more than 20 hours sometimes. And it’s actually causing quite a bit of IO wait on the host system, which kind of sucks. So yeah, I feel your pain on NTFS on Linux.

Ok so good, killing the node would definitely have helped in your scenario. Though not sure if it would have in all cases. I’m personally not versed enough in go to write a PR for this, but I think we have 2 pretty decent suggestions workshopped in this topic.

  1. Have the node crash itself to protect it when response times drop significantly from the norm
  2. Adjust containment to quickly suspend, and more slowly disqualify, without letting the node get out of delivering data for a specific audit

I could create separate suggestions for those two if you think that might be helpful? Collect what we’ve discussed here in a more organized manner. Then people can vote and maybe it would be a better place to find these suggestions for devs than hidden away in this rather large topic right now. I might have some time to do this later today.

1 Like

It’s 7TB node…
Maybe I’ll need two or three days of downtime to play the shrink / resize / move game to convert this disk…

In my case, it would really heal the node - reduce the load just enough for the disk to breathe. And it would be able to quickly get out of suspend mode as soon as the disk became responsive again.

Please, go for it if you have a time. Maybe even better to create a blueprint on GitHub and make a PR, it usually attract devs attention more quickly :slight_smile:

1 Like

I also use NTFS hard disk in ubuntu too! So it seems a NTFS issue?

YES. Please, use only native filesystems for used OS.
For Linux - ext4, for Windows NTFS and maybe ReFS (see Big Windows Node file system latency and 100% activity (NTFS vs ReFS))

1 Like

Well, but I didn’t find anything about it in documents. I have been used NTFS on Ubuntu from very beginning, maybe for 2 years, until now. I can not easily converted to other filesystems, because I have no other empty hard disk to swap data into it.

1 Like

That was a good idea. Done so now here: Blueprint: Refining Audit Containment to Prevent Node Disqualification for Being Temporarily Unresponsive by ReneS
I didn’t think we had the node crashing itself worked out enough to write a blueprint for it, but I did include the suggestion under further considerations.

Also created a suggestion here so people can vote. Refining Audit Containment to Prevent Node Disqualification for Being Temporarily Unresponsive


The documentation updated when we have an additional information. Even NTFS usage make it different - your node was affected only after 2 years on NTFS in Linux, my 2 years node was affected after a week since migration to Ubuntu.

using stuff across platforms or brands without strict certification processes can be terrifying at the best of times.

complex structures, machines or code usually doesn’t combine well without years of peoples lives being spent on making sure things are 100% compatible.

Just to report back: My node came back to life for Europe-North this evening and is no longer disqualified on it:
Suspension 100 %
Audit 100 %
Online 95.87 %

So it seems that someone at Storj had a :heart: :heart: :heart:
Thank you!

Hopefully it will get un-disqualified on Saltlake as well, I keep my fingers crossed.

Luckily I had not yet wiped the data, so I am posting this also for other SNOs affected to maybe keep waiting and hoping.

1 Like

The team decided to reinstate nodes disqualified because of timeout since the containment bug was introduced ~ 2021-07-24T00:00:00Z until the date when the fix has been deployed (2021-08-23T00:00:00Z).
The reinstate is applied except Saltlake.
The Saltlake satellite has resisted to apply the fix, but we will try to fix it on Monday.
If your node still have data - it can recover the reputation, if data is lost it will be permanently disqualified.

1 Like

Why not saltlake :sob::sob::sob:

1 Like

:open_mouth: :open_mouth: :open_mouth: :open_mouth: :open_mouth: :open_mouth: :open_mouth: :open_mouth: