Got a disaster on 03SEP (unrecoverable data loss)

OlivierGaland · September 5, 2022, 5:32am

Hello,

I’m running a storj node on a qnap since nov-2021, it was finely running and had now about 3Tb data.

Unfortunately the 03 sep I got a disaster on my NAS and lost the complete filesystem holding storj datas … (basically a long power shortcut while sleeping, the nas was improperly shut down as the battery went off before auto-shutdown. Then my nas was bugged and loop-booting, once i finally went in degraded mode and stop this horrible loop I noticed that my raid volume was corrupt beyond repair … game over).

I’m still working on recovering the few datas I can on the nas … For storj I migrated to a computer with single disk for now (and I may consider sticking to this with possibly a backup job as I thought this kind of issue never happen on raid : i was wrong).

Of course I had to recreate the node from scratch (only with my identity) and I’m afraid I will get lots of failed audit on the next weeks as I’ve lost 3 Tb of datas …

What is going to happen to me regarding suspension ? Will I’ll be definitly banned from the network ? Are they any solution ?

Best regards,
Olivier, France.

peter_linder · September 5, 2022, 6:05am

If you have lost all the data, then you need to restart with a new identity. The old identity will be disqualified because so much data was lost, so adding new data right now on the “same” node won’t work anyways.

I feel for you. I lost a node a few days ago because I resized a partition wrong by mistake. This time I lost my identity files because they were overwritten. 500G of node data was intact, but no identity file so I couldn’t start the node… sad.

It feels like I lost a good pair of gloves, i mean probably no big deal but still… hmmmf.

OlivierGaland · September 5, 2022, 5:17pm

Thanks for your feedback.

As expected i got disqualified a few hours later. I created a new identity to restart a node from scratch.

I used the same mail address, I hope this is not a problem, btw I’m also wondering if running several node instead of a big one is a better approach in case this disaster happen again.

Many thanks, Olivier

peter_linder · September 5, 2022, 6:34pm

Several nodes would work better if they can fail independently from each other, ie being installed on different HDDs.

This will also provide a little bit of protection from the usual big failure reason, the sysadmin

dragonhogan · October 21, 2022, 3:24am

I second this sentiment, which is also the recommendation from Storj anyways…

I just lost my “node #3” out of 6 that I had running. Node 3 was running on an independent raspberry pi 4, with a dedicated 4TB WD Red CMR drive that had close to 4 years of power-on time and was repurposed from a synology NAS…after realizing that the node went offline the other day, I pulled the drive as it wouldn’t mount per the fstab config and then after realizing that it also wouldn’t mount manually on that rpi or on another machine, I then ran fsck on the disk and essentially every single file of the 1.5TB space used was corrupted. Fsck threw every single file from the HDD into “lost+found” folder. After spending a day reassigning the access to that folder to user “pi” so I could view the folder structure (on GUI) I realized that it didn’t just move the folders/files, it organized them in such a way that I’m sure the node is unrecoverable. Real bummer too since that node just hit it’s 2-year old birthday this month.

Regardless, the silver lining of it all is since data was on a dedicated disk, I only lost about 1.5TB of the 18.5TB I had stored…so still have 17TB spread across the other 5 nodes.

I’m assuming that the issue is really with the drive since I saw some errors when I had it connected to my unraid server and ran a smart test on it. Originally, I thought that I might be able to recover the data (including identity and DBs) to the unraid array, and then just restart run it from there since I had just recently tested that out with Node #4 to get it off of a separate dedicated rpi4 and a really old 1TB external HDD that was really slow. But obviously, I think the drive is just not worth the effort at this point. Might just try throwing it back into my backup synology NAS that I just use for backups at this point and just wait for it to fully die off.

Toyoo · October 21, 2022, 7:13am

I am sorry for your loss.

If the failure affected directory structure, but not the files themselves, it might have been recovered by reading file headers. File headers store enough information to recover the directory structure. I wrote a prototype script to do so.

I understand that at this point this information is useless to you, but maybe for some people in future it will enable recovery.