Had an issue earlier like the HDD being active 100% all the time. Many people helped me understand a lot of things.
But like I feared, my main node HDD is failing. It received lots of I/O errors in the Event Viewer, and this HDD is utterly slow. I figured I have to migrate to a new HDD. Unfortunately, the copying process stuck midway and ran idle of couple of days. Without the full data being transferred I cannot replace this with a new HDD.
Node info
This is my Main node
Over 600GB filled [of 14 TB]
3 Months old Node [Approx]
Copying stuck somewhere between 300-400 GB
And the disk is at 100% active time
What are all my solutions? Without full copying I cannot start the node from the new disk, right? Im pretty sure the disk is failing. Since it is my main node, I cannot lose it. Any help would be greatly appreciated. Thanks
PS: Node is running in background. I dont know how long is safe to stop the node when copying.
You have several options. The best one is to use a clone software like Linux dd or GPart.
The simplest and more dangerous:
Disable scan on startup and reduce an allocation way below the usage in the config (perhaps you would also need to reduce the monitored amount too, if you have usage less than 500GB), make sure that you did not have duplicated options
No, it will not be deleted. At least immediately. It will not accept a new one, the existing one would be deleted if it’s deleted by the customers. The reduction of the allocation doesn’t invoke a data deletion anyway. So, your dashboard will show the excess used data as “Overused”. But it can stuck here for months or longer, until the customer would not delete it.
When I change the config, and restarts node, it will take time until the overused data is deleted. Does that mean, I have to wait till that happens to migrate to a new HDD? My main concern is the copying files stuck somewhere midway.
Around 600 GB is occupied in the present disk [that is failing]. New disk has 10 TB in total. So storage space isnt the issue. My question, after I change the config and restart the node, I still have copy the entire 600 GB storage to the new HDD? Since you are saying overused data will take time to delete.
I did follow the migrate docs. Mirroring cmd works great until it stuck midway. So, less storage means greater chance of copying fully. Or else if there is a way to reset the node and start afresh, I shall [With a new HDD]
Im Sorry, did I miss something? I was on the view that I need the entire data from my storage node [Full HDD] to be copied to the new HDD [The replacement]. Is that true? or do I just need the identity and storage node folder in the old HDD?
If the drive is actively failing, please stop using it, you are only damaging it further.
The only program that can reliably clone data out of a damaged disk is ddrescue. It slows down over damaged areas, auto skips ahead, requests sectors backwards (ie not to hit a damaged sector through the disk’s own read-ahead cache), disables caching and goes directly to the drive, and a whole lot of other tricks to get the most data it can out of a drive.
If you use dd for example, that will hang the system forever (=up to the disk drive subsystem timeout on the operating system) if a read error is encountered. This assumes ofc that the drive does not timeout on errors (most non-raid designed drives do not). The result would be that several sectors are shown as errors, since the drive is stuck trying to read a sector over and over again. ddrescue would skip ahead of that sector and start coming back to it backwards to try and read it.
You need to get everything out of the old hdd into the new hdd. The most important parts are the identity files, since that is well the identity of the node. If you lose a file or two due to corruption (ie damaged/unreadable sectors) that’s workable and probably will not be an issue to the storagenode, but losing the identity files kills the node.
Of course. The data and its identity the essential of the node, any of them is useless without other. You must have both, otherwise - the disqualification and deleting everything (include storagenode folder) and start from a scratch, as it was at the beginning: generate a new identity, sign it with a new authorization token and start with a clean storage.
You can also look inside the node’s blobs folder and compare the size of the sub folders on both the new and old HDD to see which folders have been completely copied. Each satellite has its own folder inside the blobs folder to store the data. For example, If 3 sub folders copied completely but not the 4th, you might be able to continue the node on 3 satellites, but will be disqualified on the 4th satellite that has the missing data.
Perhaps it wouldn’t be so useful, because data can be removed from the original location during migration, if the node was online.
The more reliable way is to run rsync (Linux) or robocopy (Windows) or rclone in a dry run mode, any of these commands would compare checksums, sizes and dates in the source and the destination and would print what it wants to be copied.
However, it’s a very intensive operation and for the dying disk it could be the last nail in its coffin.
Following the suggestion, I stopped the node. I make sure both the dying and new HDD are in USB 3.0 ports and now Im ropycopy ing the data [Around 650GB]. Hopefully it will finish in less than 32 hours. Else, most probably the node will be suspended, then I will create a new one. Disk active time is steady at 81% so far. If it maintains so, high chance I will finish copying. Now, waiting for the result.
After about 40 hours, copying was still on. And stuck in between a few times. Sometimes, it stuck while copying some files, and it goes on for hours. So, I deleted that particular file in between and the copying continues skipping that file. [I knew I shouldn’t have done that]. And then, I started the node again with the new HDD. It works, but online time is 57% for US1. The online time slowly increases above 70 in 3 days. Now, the node has been disqualified! I cant say, I expected this.
Now, my question is the node is till reading and writing data. Does this means, it is disqualified in a particular satellite? Should I keep this node or create a new one?
Sorry to say, but that would mean your node was offline for days thirty days ago. And you can only increase at maximum 3% a day.
You probably haven’t been disqualified because of online time. Because you can be offline for about 30 days before that happens. But below 60% online time you’re being suspended. So, are you really disqualified or just suspended?
If so, you will see a message on the dashboard. If you do not see a message about the disqualification, then the node is working. Just no ingress as @JWvdV said, until it would be greater than 60% of online score. You need to keep your node online for the next 30 days to fully recover the online score.
on all satellites or only on some? Or do you mean “suspended”? Because the last one is reversible, unlike a disqualification.