Reputation loss during 1 week of downtime

I had to take one of my nodes offline in order to robocopy its blobs to a new hard drive. Unfortunately, it only managed to copy 1.8TB since Sunday, therefore will take another 4-5 days to complete.

Will this sort of prolonged downtime have impact on my node’s reputation in any way? I am aware that downtime DQ is not in effect.

Would it be smarter to start the node up again right now and just deal with the copying taking even longer? I am afraid the next robocopy run will take forever as well…my plan is letting the current robocopy finish, then run a 2nd one (since I started the first one while the node was still up).

I think someone here reported being offline for 5-6 days and getting disqualified.

Copying while being online will be faster if you stop accepting new data, e.g. by reducing the allocated disk space to below your current amount of data.

I already did that a while ago since my SMR HDD was busy all the time :slight_smile:
Currently, without the node running, I see read rates of less than 10mb/s, often even below 1mb/s

This is not true.
Your node can be disqualified only for failed audits right now. Your node can fail audits if it lost data or access to it.

So it can’t fail audits while it is offline?

Not at the moment:

To fail audit you node should:

  1. Be online and answer on audit request
  2. Do not provide the correct piece for audit for 4 times:
    1. The first timeout will put the storagenode to the containment mode.
    2. The node will be asked for the same piece three more times.
    3. If still can’t provide the requested piece - the audit treated as failed.
  3. Do the same a few times in a row
3 Likes

copying storagenode files takes a long time because it’s a ton of IO, i think i got a ratio of 1tb to 500k files

you can run the node while copying tho… ofc this will put more load on the hdd you are copying from.
and extend the total copy time…

NOTE : THE NODE HAS TO BE OFFLINE FOR THE FINAL COPY / SYNC

i usually run rsync while the node is running, then run it a few times so it’s all up to date… and when it finishes the rsync in like 30min or less, then i shutdown the node and run a final rsync with the delete parameter so both folders are an exact match… then i ofc verify the number of files and total space…

not always easy tho… because zfs… so a couple of times i just trusted it was fine :smiley:
and then simply spin up the new node, monitor the logs for problems… and if it seems good for a day i will delete the old folder.

ofc i’m sure everybody has their own process in regard to this… oh yeah and then i ofc scrub both pools before beginning and scrub the target pool when the copy is pretty much complete… to make sure there isn’t any errors

1 Like

It mostly takes a long time because of the slow HDD. My other HDD managed a stable 150mb/s (that’s the limitation oif my old USB 3.0 enclosure) while copying a node.

As I said, I’m worried it won’t ever finish, so I considered it wiser to just keep it offline and let the poor drive finish copying.

Correct me if I’m wrong, but I think this conflates 2 different things.

  • If the node responds with “I don’t have the file” or “here it is”, but it’s corrupt, it will count as a failed audit right away and count against the audit score. Which will eventually disqualify the node if it drops too low.
  • If it times out or other error, it will be retried 3 more times and then count as a failed audit for the suspension score. Which will eventually suspend the node if it drops too low. Your node can recover from suspension if the issue is solved.

The suspension is applied only for answering with “unknown” error, i.e.:

  • not timeout (it handled by containment mode);
  • not “file not found” (this is immediately counted as failed);
  • not “wrong hash” (this is immediately counted as failed).
1 Like

You’re right of course, my bad. So time out eventually counts against audit score and thus disqualification.

Exactly. But not from the first timeout, only after four attempts.
This is why I always say that suspension in the current implementation can’t be used instead of containment or disqualification. It’s a third option.

1 Like

Why do you say it is untrue I think?

You may think it, someone may have claimed that, but it isn’t true. There is currently no disqualification for down time.

1 Like

Then I’m sad Alexey can’t say it in a more polite way than accusing me of not thinking.

I am absolutely 100% sure you’re misinterpreting that. He was simply saying that what was claimed isn’t true. I’m pretty sure you’re both not speaking your native language (neither am I). It’s best to assume the best intentions in those scenarios. Especially here, where peoples intention is merely to help each other out.

6 Likes

I am sorry if I offended you, I had no such intention.
I said that other people’s claims of being disqualified due to downtime are false.

5 Likes

I didn’t read it that way. He is a polite guy :slight_smile:

5 Likes

Sorry I reacted this way. I guess I had a rough day.

4 Likes

Update:

I brought the node back online only 2 days ago and so far, everything’s running fine