Boot drive died; full system rebuild

jmetaj · May 1, 2020, 6:23pm

My system’s boot drive (SSD) completely died last night. Everything was err’ing or shutting down due to “no space on device.” This morning I had to rebuild Ubuntu (now 20.04) on to another drive. Fortunately, all Storj data was on a separate HDD, so I started the docker container and it seems to be recovering well.

Do I need to do anything in terms of reporting this situation? The Storj node was offline ~18 hours, so it’s availability will likely be impacted. I believe I should keep the 6TB data and let satellites determine what’s useful. Is that right? How long must a node be offline for it to make sense to blow away the data and start over?

Floxit · May 1, 2020, 7:23pm

R.I.P., which bad SSD was it, and what was its lifetime just by curiosity ? If you buy another SSD, I strongly recommand you the Samsung Pro series (or any enterprise grade which will be more reliable I guess), which come with 5 years warranty at least, and last more longer (my 850 Pro used as system drive had now close than 5 years, and 85% of expected autonomy in the SMART).

About your node, the max downtime before disqualification will be 5 hours a month, but currently, its still not implemented, so you could eventually recover hopefully without penalty.

During the alpha, a lot of nodes get troubles, so Storj was informed about that and since its still in development. Except if your node was pretty new and you don’t care about that, if you were in a good uptime previsously, you could keep your node.

SGC · May 1, 2020, 8:23pm

TL;DR @jmetaj don’t worry your are fine… just fix it… and no corrupt data

i think over 5 hours of dt means suspension, but like you said @Floxit … its either not implemented, not working or the time is set significantly higher… also see it written at 6 hours somewhere…
however even if you end up with that, then you will just have to endure a 6 week punishment (free downloads) and you can be offline for a good while… which also makes sense… imagine if the storj dataset got corrupted one day… they would ofc be very interested in getting the data back, thus the data has or can have great value to them… if you can restore your node, you should be able to continue to run it… even if it will take some time for the network to start trusting it again…

and ofc that trust requires your data to be okay on a 99% or better basis… duno what the metric is for that… but generally corrupt data in a distributed data storage system is often worse than no data at all… so i doubt that anything less than 99% will ever do… maybe even 99.9% or better data integrity.

offline time… tho not recommended will happen and should not disqualify a well performing node over temporary issues.

in enterprise it has been a common standart for decades to run mirror drive OS or similar, 1 because it gives you great read and io performance, and because the data basically become nearly indestructible if a failed drive is replaced within reasonable time.

it’s an easy solution… but really an SSD should just up a die. no matter how bad it is…
maybe if you abused it to death… i just cannot imagine how that could happen aside from overheating… they really don’t like overheating or holding to much data… it needs room to breath

most ssd’s are build like storagearrays inside and should have the ability to disconnect bad performing parts of its array and continue without them… or simply isolate singular blocks that act up…

so either you abused the hell out of it… which you would most likely know… i had an ssd i cooked at like 80 degrees Celsius and tho it did create a few bad sectors / blocks or whatever exactly it is xD then it worked fine when i put it into something else… put it in a crazy hot laptop because i figured it was a disk cooling issue…it turned out being a laptop heating disk issue …

alas i digress, your other option is simply that it was junk when you bought it or simply a unlucky draw of the silicon lottery, but my money is on the former… cheap drive, and you get what you pay for…

Samsung Evo Pro 970 1TB is the top of the line… but it’s also a pretty costly drive depending on your use case… at the very least… try to get a proper nvme based drive, if you got that option within reach… and look at the latency, because you will most likely never run into being limit by its bandwidth… and bandwidth can be doubled, just like io… by adding more devices…latency you are stuck with… but maybe that just me… xD

maybe i should get around to moving my own OS onto multiple drives… xD

another good option can be to do replication onto a slower spare drive… that way even if the system crashed, you have a slightly older copy of what you was running and running slower instead of crashing hard… it’s the cheap option which most can get up and running from whatever they can scavenge.

good luck with it.