Machine dead, replaced, nodes offline for 3 days... graceful exit or keep up?

Floxit · June 17, 2020, 1:18am

Hello,

My nodes device crashed and it looks it won’t be recoverable for a long time (or never). It seems something terrible happened to my good device used for Storj nodes, network monitoring and my personal use (streaming, gaming, …). After 5 years of 24/24 service, the device suddently shut down and rebooted with no blue screen, then it looped with the Windows 10 upgrades by crashing each time it wanted to show the session. Finally, after power off and power on, it stopped to show BIOS (power led flashing and restarting continuously), and finally, the next day, no more electrical activity at all. My guessing is the motherboard or the CPU because temperatures were degrading, but i’m not sure. I contact my little german manufacturer to see if he could still diagnose it even if its out of garantee. But, that’s it. I was prepared to SSDs or power failure issues with an UPS, but not this kind of thing, which is new to me.

So, obviously, the node has been offline, and for 3 days before I connect and reinstall it on another mini pc machines. The other problem is I’ve also stability problems with these machines I don’t really trust for the nodes uptime (docker already crashed the node with something like a material issue, “bad response”), and I’m now looking for dedicated Odroid HC2 I trust more to stay on a Windows 10 environnement and connect easily to my drives where I’ve also my own data.

Morally, it ruined all my hopes, it came when I was preparing to order two more hard drives. The managing on the machine and local access to drives were pretty easy, but this machine is now too expensive to replace only for nodes (my other mini pcs less stables are also more powerful, but i already dedicated them for computing tasks bypassing the power limit, running often hot, and making them crash sometimes, and not for something relative to a high uptime like the nodes).*

Well, I know some hidden values tracking the offline time before the staff implement the new disqualification model, but as this time, the offline disqualification is not implemented, and I don’t want leave the network. But my oldest node has a held amount of 225$. My two 12tb disks are almost full. I’ve no idea how things will going on when the disqualification will be implemented, but theorically, I can try to recover my uptime with these machines where I stopped all the others computing activity to be sure they keep “cool” to run without interruption, and I think to order two Odroid H2. Might I continue to run the nodes by ignoring my 3-days downtime for now (it was online for months before that), or might I graceful exit? Will Storj warn us before they implement the offline disqualification to the next update to avoid a brutal disqualification in the next update?

Well, again, I feel really bad about this, because I wanted to be a good SNO, I feel quite unlucky with all my current systems, and it makes me think about the people wanting to introduce a “maintenance” request to be able to replace devices when this kind of thing happens, but I read the offline disqualification will have now another kind of behavior and dead CPU/Motherboard is generally the last ultimate crash happening to SNO. I know the network can recover without me, but because I joined the community for a quite long time now, I express my apologies for this accident, hoping sunnier days with Odroid.

nerdatwork · June 17, 2020, 2:30am

If your HDD(s) don’t smell like bacon then try connecting them to your

You already know about downtime DQ but this will tell you if your HDD(s) are alive or not. I would recommend not stressing these mini pcs for a while till you get your node back up.

Yes, you must do that.

You already are and I can prove it. This is a bad thing that happened to you and as per the phrase “Bad things happen to good people” proves you are good SNO

These are unlucky times but you got an awesome community to help you get through this.

You have a few dozen unsolicited coming your way.

Do update this thread on how you got your HDD(s) connected back again. This is not the end, the happy part is when you get your node back up.

Floxit · June 17, 2020, 2:41am

Thank you for the kind words!

All the drives (hard drives and SSDs) are healthy, and they’re already running on the mini-pc’s (its actually cheap chinese fanless aluminium “industrial” cases with i7-8565u from Topton company, dealed on Alibaba the previous year, and its nice for the value, but you can probably cook an egg on it and it reaches thermal limits faster than a cooking hob, but its still very powerful and valuable, so i like them, but its not great for the little nodes because of the random crashes when its heating, or you have to use the power-limits in the bios, and it was not the purpose of these machines for me haha), one each. Also, the HDD are not powered by the PCs but by a standalone power in the HDD aluminium case. So, at least, I lost no data, and I’m quite relieved, because I forgot to backup some personal data (my music scores!), so lesson learned, I put that in the cloud.

By the way, Odroid will release the HC2 B+ revision the next month, so its a bit like an invitation to me.

BrightSilence · June 17, 2020, 4:37am

That’s really annoying. Sorry to hear it. I just wanted to say that the issue you are describing could very well be a PSU issue. It’s actually very rare that a CPU or motherboard is the problem, but PSU’s especially the cheaper no brand ones flake out all the time. If you have a chance to test it I definitely recommend trying a different PSU.

As for the nodes, just get them back online. I think you’ll be fine. Keep in mind it might take a few hours to get back to normal traffic as satellites take some time to notice your node is back online. But after that they should behave like normal. It would be a shame to do a graceful exit and lose all that data you’ve collected. When they do introduce downtime disqualification, it’s likely they will first introduce the downtime suspension and give nodes time to recover from that before they introduce disqualification. So it’s highly unlikely your nodes will be disqualified for this after the fact.

ACarneiro · June 17, 2020, 9:13am

I must say “cheap” and “Chinese” often go together and they don’t inspire a lot of confidence of quality or longevity.
Maybe consider getting something a bit more expensive which may last you longer.

Other than that, I’m sorry to hear of your ordeal but hopefully you’ve got all your data intact and this will just be a blip

Beddhist · June 17, 2020, 1:58pm

Interesting. Is the CPU on the back side of the main board? No heat sink…

Since you know it’s running hot you can run it with the cover off and put a fan on it. Perhaps there are some settings in the BIOS to underclock?

BrightSilence · June 17, 2020, 3:07pm

From the looks of the case it actually uses that as a heat sink. So better leave it on. But a fan on the case would definitely help.

Floxit · June 18, 2020, 12:19am

About the price, I succeeded to get four of them for USD 262.00 each (without ram and disk because I prefer choose them myself). For such mobile processor, Its a awesome value, even if you forget the aluminium case. Even if the processor is now a bit cheaper, its still an awesome price. And it actually works well if you use the standard 15w TDP of Intel (the reason of instability sometimes is because I removed the power limit, to be only limited by the thermal limit, but the aluminium case is not really designed for that). But Topton explained me it’s stresstested for 24 to 28h, so with the standard 15w TDP. Since I stopped the computing tasks, the machine stays cool and it works now pretty well, but because I didn’t bought for that and its a bit O.P. only for nodes, also for trust reason, I’m pretty sure the Odroid HC2 with quadcore low power celeron will be a beast (and Storj paid me quite enough to think about them).

Yes, the CPU is sticked to the aluminium case which works like a heavy heatsink, but, actually, the chinese economize on the cost by using always the same king of aluminium case and also the same kind of motherboard and same BIOS (the good thing if, the BIOS is totally open in every options, you can do everything, even break it, so you can remove the CMOS battery easily when it happens hahaha). It results to be not optimized, and it don’t use heat pipes. If I was able to custom heat pipes myself, I’m sure it would be way more effective to spread the heat through the whole case.

I actually use an usb fan in the room where it was installed and the crashes were reasonably low (like one or two times a month, so potentially only a few tasks loss but it doesn’t affect the cloud or reputation in this network), but not in my bedroom where the hard drives are installed. Silence was the reason of my research for fanless product. The device work actually well if you don’t use for intense computing workload, like the nodes, but I myself took the risk to unleash the power by removed all the power limits and keep thermal limits working (like a classic desktop cpu) in the BIOS, because it was not a problem if the system was crashing sometimes.

So yes, don’t blame the devices and the manufacturer, its actually my own settings, but if you use it with standard 15W TDP and 25W TDP with limits in burst period, the device is actually able to cool it enough. And that was the best value I could find so far, with a decent quality.

But the devices actually work pretty nicely most of the time, because I use it for ponctual computing, not permanent (not mining) but for trading algorythm (if you know Metatrader) for the MQL5 Cloud. So my idea was like doing “computing nodes” with a maitrised consumption. The i7-8565U (and the last 10th generation heating less 100xxu), even boing mobile, are actually pretty powerful when you’re able to absorb the heat. I could later move the motherboard and put in a better aluminium case and also with a better thermal paste potentially, but I didn’t found compatible things as reasonable replacement. But I would enjoy put them in a heavy aluminium bar with alternative copper, but… maybe when the garantee will finish (they still give me 3 years of garantee which is not bad).

I actually searched a long of time on alibaba a device like this with the same hardware but a better aluminium case, but I never found it sadly, or you have to buy the components separately. You’ll find better systems but its way more expensive (like more than the double or triple of the price here), and the cases are not especially better.

I discovered this kind of case is sell by Alaska, which also sell better cases, but its quite hard to find these motherboard with this processor without passing by a Intel NUC. So yes, the intel nuc is great, but its way more expensive, and it has a fan.

In the past, I wanted make my own gaming machine entirely fanless, but, after years, the kickstarted project sadly failed for problems of logistics. They also had include a place to place directly quiet fans (which could run only when the computer is in heavy load).

Floxit · June 18, 2020, 12:46am

The machine works actually well if you don’t change the original settings, with the TDP 15w which is a power limit for U-serie, and you can also try the TDP UP 25w, which could be okay. But I removed it and made some tests to use only the classic thermal limit, and it was “okay” for the tasks I was using, with a small fan in the back of the case, with some crashes sometimes due to overheat, probably. haha There are not a lot of place to put a fan inside actually, and the cpu is in the other side, its designed to be fanless, and silence was also the reason why I wanted have them.

Floxit · June 18, 2020, 12:59am

I already gave the answer in my Bright message, but I wanted to clarify by saying the Topton product is actually really decent and stable if you use the standard TDP, but I didn’t dedicated these devices to nodes and the rares crashes (like one or two a month in a 24/24 usage) were not affecting me. But even if the price was cheap, it’s actually a pretty good value.

I paid actually way more for the german mini pc I used for 5 years, but when you look inside, you often find the same components originally made in China (but the Cirrus7 fanless case is made of several aluminium frames and heat pipes, so it performs better). But I also delidded the CPU one year ago and took risk to put liquid metal, so its possible its also my fault, even if I doubt it happens only after one year. The harder to find in China is the trust of the supplier, but Topton is stresstesting for 24/48h their minipcs with the original TDP, and they work for business. It also comes with 3-years garantee. So far, it’s a pretty good value for a “normal use” (not like me, because I actually diverted its use as “mini-distributed-computing rigs”).

I don’t want blame the chinese manufacturer, and I suggest them for anyone because I’m pretty sure they actually can be used for Storj nodes. Thanks to them, I was able to quadruple my ressources capacity for the computing cloud project, even if I’m still far to reach the ROI. Time will say how long it lasts, but of course, I don’t “overclock”, the thermal limit and Windows itself make the job to keep the temperature in a safe range, but I could still make some tweaks to limit the power slightly to avoid some crashes.

In bonus, some pictures of the manufacturer I asked. I appreciated to do a small virtual visit!

To conclude, for numerous of products I bought since a few years now, to concretize a lot of projects, I want to say thanks to China (and also Korea for Odroid haha).