So I got a new SeaGate HDD 8TB from the EXOS line
Set it up and copied node data took 3 days to do. Luckily that stage everything went ok.
About 3 Days later everything is ok so I decide to remove the old HDD in process I Shutdown the machine and took it out. Once the system booted up. The node wouldn’t start. Than I realized the Storj Volume wasn’t showing.
After opening diskpart I requested to initialize the HDD. (Now this isn’t a complete disaster just yet) Because under normal circumstances you can fully recover the entire HDD. Initialize it Set up a new partition as RAW and using diskpart convert it back to NTFS. Done that in past worked like a charm.
This time a big issue: cyclic-redundancy check. After searching internet all solutions were for HDD that are Initialized non for uninitialized after 3H I’m giving up.
The questions are:
Should I run the old HDD even though the node was running for 3Days on the new one? I think it gained a few Gb over that time so I would most likely drop some audit score about my guess 1-2%?
should I continue recovering the data? Any recommendations are more than welcomed.
Anyway I’m going to RMA the HDD because It has a total of less than 150H of run time. But it would be nice to recover the data
i would try another cable, maybe put the old disk back and try to spin the system up in its former state if possible… sometimes weird stuff happens, for whatever reason.
i doubt the old node copy is viable, but maybe…
the real question is why the drive would just die… but then again new disks have higher chances of just dying than worn in disks, some companies even run disk for a period in less critical tasks to wear them in.
i would assume the disk is fine, is it possible you switched out the OS and had some sort of encryption running… is the disk smart really giving CRC errors, because that would indicate something is fundamentally wrong with the drive, there are tests that can be run from the smart on the disk, which should be able to tell you if its viable.
the best advice i can give is not to rush it, deleting or damaging the data is easy to do
and you do have like a couple of weeks before suspension and like a month to figure this out before the node will get DQ for inactivity.
there also exists data recover programs such as those made by EASE US i think they are called, which can often recover data from a damaged drive, ofc that means you need a place to put the data.
i would certainly try to make the new disk work… but if all else fails, nothing is lost from trying the old copy… just remember that can get the identity DQ which would also make the data on the new drive DQ
This advise comes too late now, but I always recommend running an extended smart test on a new HDD. It’s no guarantee that it won’t fail in the first few days/weeks, but it catches quite a few issues before it can cause damage to your data.
For now I agree with @SGC, take your time, try to recover first. Running the old copy of the node might work, but that should be a last resort. Better to have some downtime while you recover the full data than losing the node entirely because you’re failing too many audits.
Ah I think it is very bad.
I tried to read the SMART DATA. Now I know how it normally looks but this time it wouldn’t load it. It says that the status is ok but doesn’t read other data.
I’m not an expert in this so if you have a trusted tool let me know
I tried, even called up a friend who does this in the company. In the process the HDD stopped responding totally so no chance it is RMA from here on out
I as a result I have started the node with 3 day old data
As expected the audit went down but seems to be an average of 96.61% with the lowest satellite at 92.8% but all satellites seem to be improving so fingers crossed and lost of luck
best way to help it survive is to allow it space to grow if possible… as the ratio of lost data will become less against the new data… ofc this is greatly dependent on how large the node is…
but still we are seeing 15 to 20gb ingress a day… so lets say maybe 500gb in a month…
if its fully vetted which it ofc would be since it is getting more space.
so 500gb a month and then if its say a 4TB node… in 8 month you will have doubled the data on it and thus halved the chance of it dying to random audits.
or until we get a new structure for audits, which won’t randomly due to bad luck get a node DQ
but hey atleast the current one is a bit more relaxed on when it will DQ… tho it is rather random
Yes I fully agree, A better audit strategy would be appprechited.
As to the node, It s about 1.9 TB in size currently limited by the HDD which it is on. I think over the 3 dazs it has grown about 40GB because innitially the first day i didnt allow it to grow because i didnt change the config .yml file
but the 2 other days full on
Im hoping that the lowest will be 80%
This is a question to all:
when the audits drop is the node less likely to recieve data?
Do you have any idea what percentage of data you might be missing? If not, how much data is stored on the node and how long was the node receiving data that is now lost?
Most likely your node will survive long term with the current implementation, but better get used to those scores jumping all over the place. As there currently isn’t any consistency in those numbers.
But currently you need to lose more than 10% to run a realistic risk of disqualification (yeah, that’s not great).
I’ve suggested changes before that seemed to get some traction and as a goal we seemed to agree that in the future, around 2% would be a better percentage to allow at most. But you may not even get to that point.
Sorry I didn’t understand this bit full could you elaborate?
well it should be simple math’s most likely the storage node Ingresses around 35-55GB and the size is about 1.9TB so the percentage lost is at worst around 3%
Check the link in my post. I think the info there might even be too elaborate. But I did extensive simulations a while back to determine what loss percentage would be survivable in order to recommend changes to the formula. With the current audit mechanism, 10% loss is basically survivable.
Yeah, that sounds about right, probably a little less. Just keep it online, your node will likely be fine. It may get tricky if changes get implemented along the lines of my suggestion, but hopefully by that time you’ve gained some more data on that node and the missing data percentage will have dropped.
Nothing, other than updating your node score. Audit isn’t meant to determine health of individual pieces, just the health of the node as a whole. There is plenty of redundancy to deal with any small amount of missing pieces.