Reputation loss during 1 week of downtime

twl · August 5, 2020, 5:37am

I had to take one of my nodes offline in order to robocopy its blobs to a new hard drive. Unfortunately, it only managed to copy 1.8TB since Sunday, therefore will take another 4-5 days to complete.

Will this sort of prolonged downtime have impact on my node’s reputation in any way? I am aware that downtime DQ is not in effect.

Would it be smarter to start the node up again right now and just deal with the copying taking even longer? I am afraid the next robocopy run will take forever as well…my plan is letting the current robocopy finish, then run a 2nd one (since I started the first one while the node was still up).

Toyoo · August 5, 2020, 6:54am

I think someone here reported being offline for 5-6 days and getting disqualified.

Copying while being online will be faster if you stop accepting new data, e.g. by reducing the allocated disk space to below your current amount of data.

twl · August 5, 2020, 6:58am

I already did that a while ago since my SMR HDD was busy all the time
Currently, without the node running, I see read rates of less than 10mb/s, often even below 1mb/s

Alexey · August 5, 2020, 7:57am

This is not true.
Your node can be disqualified only for failed audits right now. Your node can fail audits if it lost data or access to it.

twl · August 5, 2020, 8:52am

So it can’t fail audits while it is offline?

Alexey · August 5, 2020, 8:54am

Not at the moment:

To fail audit you node should:

Be online and answer on audit request
Do not provide the correct piece for audit for 4 times:
1. The first timeout will put the storagenode to the containment mode.
2. The node will be asked for the same piece three more times.
3. If still can’t provide the requested piece - the audit treated as failed.
Do the same a few times in a row

SGC · August 5, 2020, 9:50am

copying storagenode files takes a long time because it’s a ton of IO, i think i got a ratio of 1tb to 500k files

you can run the node while copying tho… ofc this will put more load on the hdd you are copying from.
and extend the total copy time…

NOTE : THE NODE HAS TO BE OFFLINE FOR THE FINAL COPY / SYNC

i usually run rsync while the node is running, then run it a few times so it’s all up to date… and when it finishes the rsync in like 30min or less, then i shutdown the node and run a final rsync with the delete parameter so both folders are an exact match… then i ofc verify the number of files and total space…

not always easy tho… because zfs… so a couple of times i just trusted it was fine
and then simply spin up the new node, monitor the logs for problems… and if it seems good for a day i will delete the old folder.

ofc i’m sure everybody has their own process in regard to this… oh yeah and then i ofc scrub both pools before beginning and scrub the target pool when the copy is pretty much complete… to make sure there isn’t any errors

twl · August 5, 2020, 11:05am

It mostly takes a long time because of the slow HDD. My other HDD managed a stable 150mb/s (that’s the limitation oif my old USB 3.0 enclosure) while copying a node.

As I said, I’m worried it won’t ever finish, so I considered it wiser to just keep it offline and let the poor drive finish copying.

BrightSilence · August 5, 2020, 1:20pm

Correct me if I’m wrong, but I think this conflates 2 different things.

If the node responds with “I don’t have the file” or “here it is”, but it’s corrupt, it will count as a failed audit right away and count against the audit score. Which will eventually disqualify the node if it drops too low.
If it times out or other error, it will be retried 3 more times and then count as a failed audit for the suspension score. Which will eventually suspend the node if it drops too low. Your node can recover from suspension if the issue is solved.

Alexey · August 5, 2020, 7:24pm

The suspension is applied only for answering with “unknown” error, i.e.:

not timeout (it handled by containment mode);
not “file not found” (this is immediately counted as failed);
not “wrong hash” (this is immediately counted as failed).

github.com

storj/storj/blob/d654ab5fa0b8f5cb17a621ebb506be45799aa0ff/docs/blueprints/audit-suspend.md

# Storagenode "Suspension" State Blueprint

## Introduction

Currently, when a storagenode is audited for an erasure share, there are five possible outcomes:

1. Success: The node responds with the correct data
2. Failure: The node responds with incorrect data
3. Offline: The node cannot be contacted
4. Contained: The node can be contacted, but the connection times out before all the data can be received by the satellite
5. Unknown: The node responds with any other error

Only cases 1 and 2 directly affect a node's audit reputation, which can cause disqualification.

When the [downtime tracking service](./storage-node-downtime-tracking.md) is fully implemented, case 3 can indirectly cause a disqualification.

Case 4 can also indirectly cause disqualification, since a node placed in containment mode will be re-audited at some point with the same 5 potential outcomes.

Case 5 is the only situation where there is currently no potential penalty for responding to an audit with some type of error. Fortunately, having this case has allowed us to find, diagnose, and fix several problems with storagenodes, increasing network durability. Unfortunately, it allows us to perceive nodes that consistently respond to audits with unknown errors as "healthy", giving us an inflated view of durability.

This file has been truncated. show original

BrightSilence · August 5, 2020, 7:49pm

You’re right of course, my bad. So time out eventually counts against audit score and thus disqualification.

Alexey · August 5, 2020, 7:52pm

Exactly. But not from the first timeout, only after four attempts.
This is why I always say that suspension in the current implementation can’t be used instead of containment or disqualification. It’s a third option.

Toyoo · August 5, 2020, 9:10pm

Why do you say it is untrue I think?

BrightSilence · August 5, 2020, 9:26pm

You may think it, someone may have claimed that, but it isn’t true. There is currently no disqualification for down time.

Toyoo · August 5, 2020, 9:33pm

Then I’m sad Alexey can’t say it in a more polite way than accusing me of not thinking.

BrightSilence · August 5, 2020, 9:36pm

I am absolutely 100% sure you’re misinterpreting that. He was simply saying that what was claimed isn’t true. I’m pretty sure you’re both not speaking your native language (neither am I). It’s best to assume the best intentions in those scenarios. Especially here, where peoples intention is merely to help each other out.

Alexey · August 6, 2020, 5:21am

I am sorry if I offended you, I had no such intention.
I said that other people’s claims of being disqualified due to downtime are false.

twl · August 6, 2020, 5:57am

I didn’t read it that way. He is a polite guy

Toyoo · August 6, 2020, 8:45pm

Sorry I reacted this way. I guess I had a rough day.

twl · August 17, 2020, 5:57am

Update:

I brought the node back online only 2 days ago and so far, everything’s running fine