I had to take one of my nodes offline in order to robocopy its blobs to a new hard drive. Unfortunately, it only managed to copy 1.8TB since Sunday, therefore will take another 4-5 days to complete.
Will this sort of prolonged downtime have impact on my node’s reputation in any way? I am aware that downtime DQ is not in effect.
Would it be smarter to start the node up again right now and just deal with the copying taking even longer? I am afraid the next robocopy run will take forever as well…my plan is letting the current robocopy finish, then run a 2nd one (since I started the first one while the node was still up).
I think someone here reported being offline for 5-6 days and getting disqualified.
Copying while being online will be faster if you stop accepting new data, e.g. by reducing the allocated disk space to below your current amount of data.
I already did that a while ago since my SMR HDD was busy all the time
Currently, without the node running, I see read rates of less than 10mb/s, often even below 1mb/s
copying storagenode files takes a long time because it’s a ton of IO, i think i got a ratio of 1tb to 500k files
you can run the node while copying tho… ofc this will put more load on the hdd you are copying from.
and extend the total copy time…
NOTE : THE NODE HAS TO BE OFFLINE FOR THE FINAL COPY / SYNC
i usually run rsync while the node is running, then run it a few times so it’s all up to date… and when it finishes the rsync in like 30min or less, then i shutdown the node and run a final rsync with the delete parameter so both folders are an exact match… then i ofc verify the number of files and total space…
not always easy tho… because zfs… so a couple of times i just trusted it was fine
and then simply spin up the new node, monitor the logs for problems… and if it seems good for a day i will delete the old folder.
ofc i’m sure everybody has their own process in regard to this… oh yeah and then i ofc scrub both pools before beginning and scrub the target pool when the copy is pretty much complete… to make sure there isn’t any errors
It mostly takes a long time because of the slow HDD. My other HDD managed a stable 150mb/s (that’s the limitation oif my old USB 3.0 enclosure) while copying a node.
As I said, I’m worried it won’t ever finish, so I considered it wiser to just keep it offline and let the poor drive finish copying.
Correct me if I’m wrong, but I think this conflates 2 different things.
If the node responds with “I don’t have the file” or “here it is”, but it’s corrupt, it will count as a failed audit right away and count against the audit score. Which will eventually disqualify the node if it drops too low.
If it times out or other error, it will be retried 3 more times and then count as a failed audit for the suspension score. Which will eventually suspend the node if it drops too low. Your node can recover from suspension if the issue is solved.
Exactly. But not from the first timeout, only after four attempts.
This is why I always say that suspension in the current implementation can’t be used instead of containment or disqualification. It’s a third option.
I am absolutely 100% sure you’re misinterpreting that. He was simply saying that what was claimed isn’t true. I’m pretty sure you’re both not speaking your native language (neither am I). It’s best to assume the best intentions in those scenarios. Especially here, where peoples intention is merely to help each other out.