HDD Failing - Couldnt migrate data. Suggestions?

Mjdaran · June 15, 2024, 1:53am

Had an issue earlier like the HDD being active 100% all the time. Many people helped me understand a lot of things.

But like I feared, my main node HDD is failing. It received lots of I/O errors in the Event Viewer, and this HDD is utterly slow. I figured I have to migrate to a new HDD. Unfortunately, the copying process stuck midway and ran idle of couple of days. Without the full data being transferred I cannot replace this with a new HDD.

Node info

This is my Main node
Over 600GB filled [of 14 TB]
3 Months old Node [Approx]
Copying stuck somewhere between 300-400 GB
And the disk is at 100% active time

What are all my solutions? Without full copying I cannot start the node from the new disk, right? Im pretty sure the disk is failing. Since it is my main node, I cannot lose it. Any help would be greatly appreciated. Thanks

PS: Node is running in background. I dont know how long is safe to stop the node when copying.

Alexey · June 15, 2024, 2:50am

You have several options. The best one is to use a clone software like Linux dd or GPart.
The simplest and more dangerous:

Disable scan on startup and reduce an allocation way below the usage in the config (perhaps you would also need to reduce the monitored amount too, if you have usage less than 500GB), make sure that you did not have duplicated options

storage.allocated-disk-space: 100GB
storage2.monitor.minimum-disk-space: 0B
storage2.piece-scan-on-startup: false

Save the config and restart the node via Services applet or as an admin in the PowerShell:

Restart-Service storagenode

Continue copying with a robocopy command (see How to migrate the Windows GUI node from one physical location to another? - Storj Docs).

Mjdaran · June 15, 2024, 6:49am

If I reduce the allocated disk space, the rest of the data will be deleted, right? Then I might have less storage to copy?

Alexey · June 15, 2024, 8:29am

No, it will not be deleted. At least immediately. It will not accept a new one, the existing one would be deleted if it’s deleted by the customers. The reduction of the allocation doesn’t invoke a data deletion anyway. So, your dashboard will show the excess used data as “Overused”. But it can stuck here for months or longer, until the customer would not delete it.

Mjdaran · June 15, 2024, 9:06am

When I change the config, and restarts node, it will take time until the overused data is deleted. Does that mean, I have to wait till that happens to migrate to a new HDD? My main concern is the copying files stuck somewhere midway.

Alexey · June 15, 2024, 9:08am

No. You can migrate, while your node doesn’t accept new ingress.
Or do you mean, that the destination disk doesn’t have enough free space?

no, if you use a migration guide

Mjdaran · June 15, 2024, 9:14am

Around 600 GB is occupied in the present disk [that is failing]. New disk has 10 TB in total. So storage space isnt the issue. My question, after I change the config and restart the node, I still have copy the entire 600 GB storage to the new HDD? Since you are saying overused data will take time to delete.

I did follow the migrate docs. Mirroring cmd works great until it stuck midway. So, less storage means greater chance of copying fully. Or else if there is a way to reset the node and start afresh, I shall [With a new HDD]

Im Sorry, did I miss something? I was on the view that I need the entire data from my storage node [Full HDD] to be copied to the new HDD [The replacement]. Is that true? or do I just need the identity and storage node folder in the old HDD?

Mitsos · June 15, 2024, 9:44am

If the drive is actively failing, please stop using it, you are only damaging it further.

The only program that can reliably clone data out of a damaged disk is ddrescue. It slows down over damaged areas, auto skips ahead, requests sectors backwards (ie not to hit a damaged sector through the disk’s own read-ahead cache), disables caching and goes directly to the drive, and a whole lot of other tricks to get the most data it can out of a drive.

If you use dd for example, that will hang the system forever (=up to the disk drive subsystem timeout on the operating system) if a read error is encountered. This assumes ofc that the drive does not timeout on errors (most non-raid designed drives do not). The result would be that several sectors are shown as errors, since the drive is stuck trying to read a sector over and over again. ddrescue would skip ahead of that sector and start coming back to it backwards to try and read it.

You need to get everything out of the old hdd into the new hdd. The most important parts are the identity files, since that is well the identity of the node. If you lose a file or two due to corruption (ie damaged/unreadable sectors) that’s workable and probably will not be an issue to the storagenode, but losing the identity files kills the node.

Alexey · June 15, 2024, 2:13pm

Of course. The data and its identity the essential of the node, any of them is useless without other. You must have both, otherwise - the disqualification and deleting everything (include storagenode folder) and start from a scratch, as it was at the beginning: generate a new identity, sign it with a new authorization token and start with a clean storage.

Mark · June 15, 2024, 3:18pm

You can also look inside the node’s blobs folder and compare the size of the sub folders on both the new and old HDD to see which folders have been completely copied. Each satellite has its own folder inside the blobs folder to store the data. For example, If 3 sub folders copied completely but not the 4th, you might be able to continue the node on 3 satellites, but will be disqualified on the 4th satellite that has the missing data.

Alexey · June 16, 2024, 6:41am

Perhaps it wouldn’t be so useful, because data can be removed from the original location during migration, if the node was online.
The more reliable way is to run rsync (Linux) or robocopy (Windows) or rclone in a dry run mode, any of these commands would compare checksums, sizes and dates in the source and the destination and would print what it wants to be copied.
However, it’s a very intensive operation and for the dying disk it could be the last nail in its coffin.

The best solution is suggested by @Mitsos :

Mjdaran · June 16, 2024, 6:52am

Following the suggestion, I stopped the node. I make sure both the dying and new HDD are in USB 3.0 ports and now Im ropycopy ing the data [Around 650GB]. Hopefully it will finish in less than 32 hours. Else, most probably the node will be suspended, then I will create a new one. Disk active time is steady at 81% so far. If it maintains so, high chance I will finish copying. Now, waiting for the result.

Mjdaran · June 21, 2024, 12:51am

After about 40 hours, copying was still on. And stuck in between a few times. Sometimes, it stuck while copying some files, and it goes on for hours. So, I deleted that particular file in between and the copying continues skipping that file. [I knew I shouldn’t have done that]. And then, I started the node again with the new HDD. It works, but online time is 57% for US1. The online time slowly increases above 70 in 3 days. Now, the node has been disqualified! I cant say, I expected this.

Now, my question is the node is till reading and writing data. Does this means, it is disqualified in a particular satellite? Should I keep this node or create a new one?

JWvdV · June 21, 2024, 2:21am

A pity, you last your node.

Sorry to say, but that would mean your node was offline for days thirty days ago. And you can only increase at maximum 3% a day.

You probably haven’t been disqualified because of online time. Because you can be offline for about 30 days before that happens. But below 60% online time you’re being suspended. So, are you really disqualified or just suspended?

JWvdV · June 21, 2024, 2:24am

Why, suspension is done reversible thing. Keeping the node online will automatically unsuspend this within some weeks.

Alexey · June 21, 2024, 7:16am

If so, you will see a message on the dashboard. If you do not see a message about the disqualification, then the node is working. Just no ingress as @JWvdV said, until it would be greater than 60% of online score. You need to keep your node online for the next 30 days to fully recover the online score.

on all satellites or only on some? Or do you mean “suspended”? Because the last one is reversible, unlike a disqualification.

Mjdaran · June 21, 2024, 7:43am

Your node has been disqualified on “node id”

By that image, I think it gets disqualified in US satellite. So the node can still run?

Alexey · June 21, 2024, 7:48am

Yes, it can, but you can remove the US1 data using the --force flag and this procedure to remove it from your node and the dashboard:

Just provide the satellite’s NodeID and the --force flag and all other required parameters for that command.

Mjdaran · June 21, 2024, 8:11am

Will the earnings be same after removing the satellite data?

Alexey · June 21, 2024, 8:46am

You would share that. We do not know, it depends on your node.