Got Disqualified from saltlake

Alexey · August 22, 2021, 12:59pm

The constructive argument - the suspension for timeouts is too easy to exploit and it will be.
I’ll not describe details how to exactly exploit the suspension for timeouts to do not make it happen or make it easier to find a loophole. Details are not needed to make idea with suspension for timeouts better - we should not allow to skip audits for any reason. If you can improve the suspension idea without skipping audits or make it costly - welcome.

BrightSilence · August 22, 2021, 5:16pm

Looks like we’re getting somewhere good.

At the time it was first mentioned, repair effecting scores kind of caught me by surprise. I don’t think there is a blueprint about it, so I’m not entirely sure how it impacts the scores. How does repair deal with time outs? I guess it would be fine if it is counted as an unknown failure, but if it’s counted as a straight up audit failure without going into containment, that would be a bit problematic.

Alexey · August 23, 2021, 6:51pm

A post was merged into an existing topic: Why does Storj use the STORJ token?

jammerdan · August 24, 2021, 6:46am

I guess a suspended node should not be paid. So there is no incentive to game audits. Instead there is an incentive to look after the node on a regular basis. On the other hand a node in suspension should only slowly move towards disqualification to give a SNO time to fix issues.

What just came into my mind is if a node gets repeatedly audited for the same piece how would it be affected if the data gets deleted by customer? Meaning that the node will no be able to present the piece for a good reason. It would be required to make sure the node won’t be stuck in suspension or get further punished in such a case.

BrightSilence · August 24, 2021, 6:58am

It only audits the same piece 10x if it times out. When there is a definitive fail response like file not found, it’s just an instant audit failure and moves on to the next one right away.

Alexey · August 24, 2021, 8:02am

If piece is deleted by the customer, it should be removed from the audit queue too, so unlikely it will be audited if it’s marked as should be (or already) deleted.

jammerdan · August 24, 2021, 8:16am

I just wanted to make aware of such a scenario where a node cannot provide the piece and put into suspension mode and then the piece gets deleted by customer. In such case the ongoing suspension, leaving suspension mode or disqualification should not depend on this specific piece(s) as the node would never be able to provide them anymore.

Alexey · August 24, 2021, 8:18am

We discuss the disqualification for audit timeouts. If the deleted piece would be audited - the node will return “file not found” and immediately fail the audit.
The suspension for missed files even more dangerous for the network than suspension for timeouts.

Alexey · August 24, 2021, 8:30am

My node finally disqualified on three satellites because of audit timeouts.
And I can confirm - nothing in logs what should warn me about ongoing issue. The dashboard doesn’t help either - the reputation updated not in real time, so it’s too late.

jammerdan · August 24, 2021, 8:32am

Which satellites? My guess is 2 of them is Saltlake and Europe-North?

Alexey · August 24, 2021, 8:46am

Saltlake, EU1, Europe-North-1
Interestingly that node registers successful audits from them until disqualification.
AP1, both Americas are affected to, but ~96%

BrightSilence · August 24, 2021, 11:19am

I think I misread this last time I responded. I thought you asked about when a node operator had deleted a piece by accident. If a customer has deleted a piece the audit worker should already check metadata to see if the segment still exists and hasn’t expired. So it should already stop auditing the same file. But even if that isn’t the case, if not enough nodes respond with the correct piece to recreate it (so less than 29 nodes respond with a correct piece) the audit failure won’t count against your score. This was recently implemented to prevent issues when deleted or expired pieces are audited incorrectly. So in the scenario you describe there should already be 2 systems in place to prevent the node from being impacted. If the fallback triggers for some reason, you will see a failed audit in the log though, but it won’t count against your audit score.

Alexey · August 24, 2021, 7:55pm

Alexey · August 24, 2021, 8:00pm

$ cat /mnt/x/storagenode3/storagenode.log | grep 1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE | grep -E "GET_AUDIT" | jq -R '. | split("\t") | (.[4] | fromjson) as $body | {SatelliteID: $body."Satellite ID", ($body."Piece ID"): {(.[0]): .[3]}}' | jq -s 'reduce .[] as $item ({}; . * $item)'
...
  "EY6AAGG33LEZK3O7NIM2S3R5VEPXRBIYCKPWYSV5LDWNMHFYF7EA": {
    "2021-08-22T22:42:40.601Z": "download started",
    "2021-08-22T22:50:28.975Z": "download started",
    "2021-08-22T23:04:59.672Z": "download canceled",
    "2021-08-22T23:11:19.397Z": "download started",
    "2021-08-22T23:15:09.613Z": "downloaded",
    "2021-08-22T23:42:01.821Z": "download started",
    "2021-08-23T00:04:21.554Z": "downloaded",
    "2021-08-23T00:30:52.619Z": "download started",
    "2021-08-23T00:35:27.385Z": "downloaded",
    "2021-08-23T00:37:23.375Z": "download started",
    "2021-08-23T01:32:05.631Z": "download started",
    "2021-08-23T01:34:09.934Z": "download started",
    "2021-08-23T01:38:56.373Z": "downloaded",
    "2021-08-23T01:56:29.461Z": "download started",
    "2021-08-23T02:09:35.964Z": "download started",
    "2021-08-23T02:10:28.193Z": "download started",
    "2021-08-23T02:19:04.072Z": "download started",
    "2021-08-23T02:37:11.049Z": "download started",
    "2021-08-23T02:40:59.384Z": "downloaded",
    "2021-08-23T03:07:47.499Z": "downloaded",
    "2021-08-23T03:13:37.528Z": "downloaded",
    "2021-08-23T03:25:25.152Z": "downloaded",
    "2021-08-23T03:31:02.822Z": "downloaded",
    "2021-08-23T03:52:26.973Z": "downloaded",
    "2021-08-23T04:01:59.479Z": "downloaded",
    "2021-08-23T04:17:16.239Z": "downloaded"
  }

In such format it’s now obvious - there was a problem. The interval between “GET_AUDIT” “download started” and “downloaded” should not be greater than 5 minutes for each piece.
And there is problem started:

  "LOK6MJKTW7OFFC76SHFV5M7LHGEZSJOG4TQHEO7XYDX2EUOYQ4XA": {
    "2021-08-21T20:38:47.156Z": "download started",
    "2021-08-21T20:38:49.832Z": "downloaded"
  },
  "6VLHZ7JZVYDRB3K7ZM55DIRUPOOBSELFX6NXRVTYIALOV4LUL32Q": {
    "2021-08-21T20:56:08.230Z": "download started",
    "2021-08-21T20:57:07.722Z": "downloaded"
  },
  "TIEF7N6K7ZQUYMUUHJBM4J6QIB44GPKUTCEAR52TGUYF332YGZTQ": {
    "2021-08-21T20:59:45.667Z": "download started",
    "2021-08-21T21:04:56.621Z": "downloaded"
  },
  "A2PLV7FUBDMJDGZP3MY4XYUS36X7OVMSLRXEBZMS2BLCMPBJOEDQ": {
    "2021-08-21T21:28:50.623Z": "download started",
    "2021-08-21T21:28:54.659Z": "downloaded"
  },

Instead of 2 seconds, it gave a piece after 4 minutes. Later longer and longer

BrightSilence · August 25, 2021, 7:01am

I figured the repair worker would only return errors it’s certain about, it seems to be even more specific than that. So that’s good to see that verified. This shouldn’t be an issue for the solution we’ve been discussing then.

That’s interesting, so at least in your case the node was significantly slowing down before it stopped working. I wonder if it could use some of its own telemetry to detect something is wrong before it gets completely unresponsive. Something like if the average response time in the last 5 minutes is more than 10x the normal response time, kill the node to protect itself. We can’t really be sure that behavior is always like this though. It’s possible some other nodes pretty much become unresponsive instantly.

Alexey · August 25, 2021, 7:21am

It has not stopped to work. The hanging is gone itself the same way as started.
I checked the disk and there were minor filesystem corruption.
I forced to continue to use NTFS after migration to Ubuntu (I do not have a free space to convert the disk).
So it’s kind of expected. I just hoped that the latest NTFS driver would be better than I remember from the last time when I tried to use NTFS in Linux with relative high disk load (it were torrents on that time).
But nothing changed. NTFS is still a second-class citizen on Linux.

The good thing - now I know the storagenode can detect that timeout and can shutdown, since it’s actually continued to run normally, just disk become temporary slow on respond.
Now we need a change (Pull Request) for that.

BrightSilence · August 25, 2021, 7:43am

Tell me about it… I still have a 2nd gen Drobo connected to my Synology. It technically supports ext3, but it already had a thin provisioned NTFS volume and you can’t just use external tools to convert that. The official method would require wiping the entire thing and adding the ext3 volume. So I’m still running it on NTFS over USB2, with SATA II connections internally. To be honest, I’m surprised I never ran into such slow response issues on that node yet. I guess it might be because it is relatively small. But even on a 1.75TB node, the garbage collection runs take more than 20 hours sometimes. And it’s actually causing quite a bit of IO wait on the host system, which kind of sucks. So yeah, I feel your pain on NTFS on Linux.

Ok so good, killing the node would definitely have helped in your scenario. Though not sure if it would have in all cases. I’m personally not versed enough in go to write a PR for this, but I think we have 2 pretty decent suggestions workshopped in this topic.

Have the node crash itself to protect it when response times drop significantly from the norm
Adjust containment to quickly suspend, and more slowly disqualify, without letting the node get out of delivering data for a specific audit

I could create separate suggestions for those two if you think that might be helpful? Collect what we’ve discussed here in a more organized manner. Then people can vote and maybe it would be a better place to find these suggestions for devs than hidden away in this rather large topic right now. I might have some time to do this later today.

Alexey · August 25, 2021, 7:47am

It’s 7TB node…
Maybe I’ll need two or three days of downtime to play the shrink / resize / move game to convert this disk…

In my case, it would really heal the node - reduce the load just enough for the disk to breathe. And it would be able to quickly get out of suspend mode as soon as the disk became responsive again.

Please, go for it if you have a time. Maybe even better to create a blueprint on GitHub and make a PR, it usually attract devs attention more quickly

allenyllee · August 25, 2021, 8:56am

I also use NTFS hard disk in ubuntu too! So it seems a NTFS issue?

Alexey · August 25, 2021, 9:00am

YES. Please, use only native filesystems for used OS.
For Linux - ext4, for Windows NTFS and maybe ReFS (see Big Windows Node file system latency and 100% activity (NTFS vs ReFS))