Raid 5 failed. 3TB node dead. Recovered partially and going through GE thanks to forum

fikros · May 18, 2020, 6:02pm

Not much to say here, had a drive fail during pandemic lockdown, but the raid, although degraded, kept working.

Today got to replacing the bad drive. Upon rebuild, another drive failed.

RIP 3TB of data. RIP escrow.

I had 100% reputation on all sattelites

It has been a good run!

kevink · May 18, 2020, 6:14pm

Sorry to hear that. Why not start over with the remaining drives?

Odmin · May 18, 2020, 6:19pm

Don’t panic, let’s fight to survive!

Try to turn off your machine and disconnect last failed drive, then attach it to another machine and scan with repair with Hard Disk Sentinel program or similar, after connect it back to machine with RAID5 and try to dring your RAID online. Is it mdraid?

fikros · May 18, 2020, 6:31pm

Oh boy. I actually recovered the raid (I don’t know for how long) The second failed drive managed to spin, and I got access to the data. Unfortunately, most of he .db files are corrupted. I don’t know how much of the data was affected. Most certainly I could not do a graceful exit. My node was online for about 1 year, and the escrow held is about 200$.

I have already spent most of my day trying to recover the dbs, but to no avail.

I might have to call it quits, and remake each hdd into it’s own node, in order to minimise the risk, if i’m about to go further.

Odmin · May 18, 2020, 6:34pm

This “boy”, have another proposition for you
If you alredy bring your RAID5 up and running but can’t recover your db’s… just delete your db’s (it not a joke) and try run your node.

kevink · May 18, 2020, 6:35pm

AFAIK the DBs aren’t even important. Only the data is. I think you should still be able to do a graceful exit if you only have the files and recreate empty db files.

fikros · May 18, 2020, 6:37pm

I tried that, but the node started with 3TB free space :(. It seems that deleting the dbs resets the storage, even when the storage is full…

SGC · May 18, 2020, 6:38pm

yeah the db can be rebuild, its about getting your data out… however if you might have 50% corrupt data… depending on how bad it is… i believe most cases of lost data in raid5 will result in 50% corruption might depend on the number of drives tho… i forget… but so long as you can read the data and try to restore it then you should be pretty okay

you should consider going to ZFS if you don’t mind linux and lots of new troubles… but it is remarkably good at keeping the data good…

SGC starts scrubbing his tank… for the 6th time since friday… xD

Odmin · May 18, 2020, 6:40pm

It’s normal, just keep it running at least for a few hours, it will be recalculated.

fikros · May 18, 2020, 6:41pm

I will give it a go. No huge expectations, but a genuine thank you for your help!

SGC · May 18, 2020, 6:42pm

fight the corruption xD if you get it running and your drives are really old, you might be able to make it into a raid6 by adding an extra one… might be worthwhile as the node grows
ofc that becomes a larger cost benefit calculation, but long term its never the worst choice… lol

Odmin · May 18, 2020, 6:44pm

You are welcome!
If the node will working fine, you will have at least a chance for graceful exit and recreate this RAID.

SGC · May 18, 2020, 6:47pm

but why would one want to do a graceful exit if he can get a few month more in… that is worth more than the graceful exit would be…

besides graceful exit requirements i think are pretty high compared to just crashing and burning a few months later… i don’t see the cost / benefit aside from maybe if one didn’t have a choice and if it was over quick… which graceful exit isn’t either… infact it seems one can barely leave because of a bug with the new satellites… heh

Storgeez · May 18, 2020, 6:47pm

Seagate? Congratulations on having a minimum of 20 characters to post.

fikros · May 18, 2020, 6:52pm

Ok. Just deleted all db files and restarted the docker container.

Some context: I am running a Vcenter Virtualised raid 5 (5 x 900GB drives) 3.6TB total.
The Storj machine is an Ubuntu with a 3.2TB hdd mapped.

The Docker image is configured with a 2.9TB limit.

I deleted all .db and shm / wal files (including revocations), practically cleaning up the storage directory (except the blob, garbage, temp and trash dirs).

I’ll keep you posted.

kevink · May 18, 2020, 6:53pm

If you have 200$ held amount, a few additional month won’t cover that. And if he doesn’t do a GE now, his node might get DQed later when too many pieces are missing.
So with a GE now, he at least knows for sure if he needs to start a new node right away.
And GE on stefan-benten would probably be enough, the other satellites don’t have much held amount.

fikros · May 18, 2020, 6:55pm

@kevink has it right. I don’t trust my disks anymore, and if I could push the data back into the system it would be great. Later, after I fix my hardware, I’ll try again.

SGC · May 18, 2020, 7:02pm

i might be wrong here but from what i remember looking at the earning calculator, the payout for a successful graceful exit was about equal to 2 or 3 months of node operation with 100% payout

so since his node is a year he is on full payout and on graceful exit he doesn’t get paid for downloads and he has much stricter demands on his data integrity and downtime allowance… and it will most likely take months to complete anyways…

so far as i can see there isn’t any current existing reasons for making a graceful exit worth even attempting… CRASH AND BURN BABY

fikros · May 18, 2020, 7:11pm

Domain Name Node ID Percent Complete Successful Completion Receipt
asia-east-1.tardigrade.io:7777 121RTSDpyNZVcEU84Ticf2L1ntiuUimbWgfATz21tuvgk3vzoA6 0.00% N N/A
saltlake.tardigrade.io:7777 1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE 0.00% N N/A
us-central-1.tardigrade.io:7777 12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S 0.00% N N/A
europe-west-1.tardigrade.io:7777 12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs 0.00% N N/A
europe-north-1.tardigrade.io:7777 12rfG3sh9NCWiX3ivPjq2HtdLmbqCrvHVEzJubnzFzosMuawymB 0.00% N N/A
satellite.stefan-benten.de:7777 118UWpMCHzs6CvSgWd9BfFVjw5K9pZbJjkfZJexMtSkmKxvvAW 0.00% N N/A

GE started. Fingers crossed

kevink · May 18, 2020, 7:24pm

You can easily calculate that your assumption is completely wrong… 3TB is 4.5$/month + maybe 10% egress * 20$ equals 6$ = 10.5$/month

I think you can estimate how many month this would take to have more than 200$ held back amount back.