SNO Flight Manual

SGC · March 24, 2020, 4:24pm

Been searching for answers on various questions, thus far to little avail, i find many of the answers on the forum but sadly even the most basic of details can take near hours or more to find.

isn’t there any public resource for customizing node’s?

i find this when searching for a wiki

but seems unrelated to SNO

Show me the manual…

or this is the topic where we start making the Storage Node Operator’s Flight Manual

deathlessdd · March 24, 2020, 4:27pm

Explain how you want to customize your node? Theres not much you can change. You need to be in compliance.

BrightSilence · March 24, 2020, 5:04pm

Just making sure you’re aware of this page
https://documentation.storj.io/

If that’s not what you’re looking for, can you be more specific?

SGC · March 24, 2020, 7:53pm

well one current, in my mind very basic example is that i want to change my config.yaml
which i’m pretty sure i got figured out, but then i start pondering stuff, like if i’m suppose to shut down the node while editing it, or if i can edit it live and if the changes then are applied or if a reboot is required.

or if i want to setup logrotate on a month or weekly basis, i sure all the information is around… but everyone searching and discovering it on their own just seems like a great waste of time, when it could be slowly compiled into a manual.

from what i have seen in the storj.io documentation it’s mostly in relation to basic setup… which don’t get me wrong makes perfect sense, this is an entirely new program/ecosystem.

generally i want to keep my storagenode as close to default as possible, i’ve found over the years of working with computers that, its really nice when stuff is really modular.

however i digress.

i was wanting to change that in my config.yaml to do what brightsilence suggested in the link above, seems straight forward and i suppose it is, i just read about people on forums that changed the wrong config.yaml, breaking their node so i like to go through the details of what i’m doing, so i’m sure i don’t break something.

When i looked into the storj documentation they talked about running docker commands, i think it was to get into the container config.yaml, but that may have been some old issues… and i cannot seem to refind it even tho i was looking at it like yesterday.

but after searching on config.yaml i found:

6.1 Optional - How to manually edit the configuration parameters

which gave me the exact location i needed to change the config.yaml at
tho i could only find that when searching in
https://support.storj.io/hc/en-us

then i found, that i do need to shutdown my node before changing the config.yaml from

still have no clue why i should use the -t 300 command when shutting down the node…
did shut it down a few times without that, didn’t seem to matter much…

but i’m sure it will make more sense when i get around checking the docker documentation.

it just seems like all this very basic stuff should be in a very easy to approach manual.
i’m sure that when i get the hang of all this, it will be very second nature… but for now trying to figure out how to do stuff, seem arcane at best.

on the upside i did figure out the other day that i can run a fullscreen putty linux terminal with a storagenode docker live log feed, by using:
docker logs storagenode --tail 20 --follow

took me a google search to figure out, that ctrl+c was the only thing that could stop it and then another google search to find out that i had to ctrl + rclick to get out of the putty fullscreen…

I know that’s just because i’m very green in linux, but still… i think a lot of people would like a better manual for all kinds of stuff, storagenode log screensaver beats cmatrix anyday of the week imo.

maybe i’m just use to most stuff being very simply explained and executed since being an advanced windows user for decades.

maybe open source just doesn’t translate well into simple documentation… i duno…

also nice to know that you can actually replicate your configuration, if everything crashes and burns… xD, instead of trying to remember what you did and why, or having to do a weeks or month long research project / build again.

I’m sure many people will join in an attempt help to create a great manual for SNO’s

heunland · March 24, 2020, 8:22pm

Thank you for your input regarding documentation. I suggest you read the FAQ section of the documentation. For example

is answered here
and

is at least partially addressed here

I see you already found some of the questions answered in the FAQ, as well as discovered that we do have a help desk at https://support.storj.io where you can search for answers to your questions. We are constantly working to improve our documentation and are always happy to hear any suggestions to improve (you can post your comments on the KB articles directly.)

So in summary, I am not sure why having the most frequently asked questions being part of the documentation and having an extensive knowledge base in the helpdesk you are suggesting is not up to par.

On the other hand, details of how to use third party software such as Putty is beyond the scope of Storj support. Each tool has their own help section on their website or in the software.

SGC · March 24, 2020, 8:45pm

Well i didn’t want to disrupt my ability to use docker log commands nor that the various scripts people made to not work by default.
so i plan to set up a cron job running this docker command using a timestamp.

docker logs storagenode >& /zPool/storj_etc/storj_logs/2020-03-24_storagenode.log

was kinda meaning to look at logrotate, but ended up with that, not sure what logrotate can do tho… lol
but should be fine for what i’m trying to do.

also i cannot forfeit the docker log storagenode --follow command… thats one of the best things i’ve discovered since moving to linux, aside from working hardware passthrough on the hypervisor… which was just a mindblowing difference compared to hyperv

Toyoo · March 25, 2020, 8:31pm

As a beginner SNO, but a seasoned Linux user, I do agree that there is very little documentation on how things work on a technical level. For example from my own experience the documentation should explicitly state that the storage node measures disk space by querying /app/config, or why it is not supported to have /app/config be placed on an SMB share.

The documentation could also present popular practices like setting up an uptime notification, the successrate.sh script, etc.

It is perfectly understandable though that this is the current state, as the software is still young. The only way to change the current state is to put some actual work on documentation. So… Is there a preferred way to contribute to the documentation site?

SGC · April 30, 2020, 9:13pm

Cron command for live appending logs, with date stamp and without redirecting logs from the storagenode container.

Wanted to keep this simple, which i didn’t, but now it works kinda great…
This will keep a 1min delayed live log saved, without having to redirect your docker log files and thus you can run all scripts by default or run stuff like docker logs storagenode --follow

how to make it work is explained in detail, below and everything is basically copy paste ready.
Enjoy… and i hope some else finds this useful.

#
# Add this to your crontab, using crontab -e
# The command will run a one line command, which maybe could have been working from only cron, but this is the best i got thus far.
#
# * * * * * (tells cron to run the storagenode_append_log.sh script every minute
# >> appends the output of the script to the log file
# $(date +\%Y-\%m-\%d) adds the current date to the log filename
# 2>&1 combines the stdout stderr into a single file (not sure if this is required tho)
#
 * * * * * /storagenodes/storj/scripts/storagenode_append_log.sh >> /storagenodes/storj/logs/storagenode_$(date +\%Y-\%m-\%d).log 2>&1





### Bash script because cron cannot do variables without this getting messy
#
# filename:
# storagenode_append_log.sh

# Location (basically where ever you want,
# Tho do keep in mind to change the location / folder in the crontab -e also
# but i set the location to a scripts folder inside to storj folder.
# seems mostly empty anyways... so... where was i

# Location:
# /storagenodes/storj/storagenode_append_log.sh

# Other related information
# use chmod +x /storagenodes/storj/scripts/storagenode_append_log.sh
# this makes the script below executable

## Bash scripts starts below this point

#!/bin/bash
# Bash script for appending to storagenode log
# If anyone know how to make this into a crontab -e commandline only, don't hold back...

docker logs --since "$(date -d "$date -2 minutes" +"%Y-%m-%dT%H:%M")" --until "$(date -d "$date -1 minutes" +"%Y-%m-%dT%H:%M")" storagenode

SGC · May 16, 2020, 3:55pm

A guide for adding a second node to new HDD on same Linux host
by Stuberman

A guide for adding a second node to new HDD on same Linux host

I read through many of the posts that ask about issues with second nodes, but I wanted to verify each step and ensure it was clear to me and others that want step by step guidance. Please correct any mistakes or vagueness I may have.

Choose additional new Internet port (28968) and add to port forwarding rules on firewall/router

Use existing wallet address

Request new authentication token (using same email address)

Install new HDD on system

Create new fstab entry see https://documentation.storj.io/resources/faq/linux-static-mount

Create new identity - [ identity create storagenode2 ] (where ‘2’ represents a new unique node name)

Authorize new identity - [ identity authorize storagenode2 auth_token ]

Verify new identities
grep -c BEGIN ~/.local/share/storj/identity/storagenode2/ca.cert [expect response ‘2’]
grep -c BEGIN ~/.local/share/storj/identity/storagenode2/identity.cert [and expect response ‘3’]

No need to install Docker - it already exists

Run new storage node (include new port ranges, paths and parameters - see example below)

Stop the watchtower: docker stop watchtower

Remove the watchtower: docker rm watchtower

Run watchtower on all nodes [ docker run -d --restart=always --name watchtower -v /var/run/docker.sock:/var/run/docker.sock storjlabs/watchtower storagenode storagenode2 watchtower --stop-timeout 300s ]

Add new uptime port monitor for node 2 at https://uptimerobot.com/

To start the new node:

sudo docker run -d --restart unless-stopped --stop-timeout 300 \

-p 28968:28967 \

-p 14003:14002 \

-e WALLET=“0xXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX” \

-e EMAIL=“user@example.com” \

-e ADDRESS="domain.ddns.net:28968” \

-e STORAGE=“ 13 TB" \

–mount type=bind,source=“ identity-dir ”,destination=/app/identity \

–mount type=bind,source=“ storage-dir ”,destination=/app/config \

–name storagenode2 storjlabs/storagenode:beta

Port Forwarding on my firewall

More detailed instructions are in the standard Storj Installation Steps which this guide is based upon.

Thanks for helping me out!

H/T @kevink

SGC · May 22, 2020, 2:14pm

How to move DB’s to SSD on Docker

How to move DB’s to SSD on Docker

Before you beginning, please make sure that your SSD has good endurance (MLC is preferred), I personally recommend using SSD mirror.

look into the official documentation and make sure that you are using –mount type=bind parameter in your docker run string

Prepare a folder with mounted SSD outside of <storage-dir> from the official documentation. (it your folder with pieces)

Add a new mont string to your docker run string:

Now we have:

docker run -d --restart unless-stopped --stop-timeout 300
-p 28967:28967
-p 127.0.0.1:14002:14002
-e WALLET=“0xXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX”
-e EMAIL="user@example.com"
-e ADDRESS=“domain.ddns.net:28967”
-e STORAGE=“2TB”
–mount type=bind,source=“”,destination=/app/identity
–mount type=bind,source=“”,destination=/app/config
–name storagenode storjlabs/storagenode:beta

should be:
docker run -d --restart unless-stopped --stop-timeout 300 \
    -p 28967:28967 \
    -p 127.0.0.1:14002:14002 \
    -e WALLET="0xXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX" \
    -e EMAIL="user@example.com" \
    -e ADDRESS="domain.ddns.net:28967" \
    -e STORAGE="2TB" \
    --mount type=bind,source="<identity-dir>",destination=/app/identity \
    --mount type=bind,source="<storage-dir>",destination=/app/config \
    --mount type=bind,source="<database-dir>",destination=/app/dbs \
    --name storagenode storjlabs/storagenode:beta
Add/change a new parameter to your config.yaml
# directory to store databases. if empty, uses data path
# storage2.database-dir: ""
storage2.database-dir: "dbs"
Stop and remove your storagenode container
docker stop storagenode -t 300 && docker rm storagenode

Copy all databases from “ storage-dir\storage ” to the new location “ database-dir ”. Do not move it! (if something goes wrong we just started with the database on old location instead of storagenode will recreate it)

Start your new docker run .... string

Make sure that on database-dir you see files with .db-shm and .db-wal like on the screenshot

Summary

If you can see files .db-shm and .db-wal on the new location “ database-dir ”, now you can delete database files from the old location “ storage-dir\storage ”.

frances · May 25, 2020, 8:38pm

this thread is going to my bookmarks tab. You´ve compiled very useful knowledge in here that I´m sure I will revisit in a few weeks (I need to go down the Linux road yet) maybe is a good idea to just put here useful things that anybody learn. For some will be maybe obvious stuff, or yet spoken in other places. But… well, we are in a redundancy friendly platform

SGC · July 15, 2020, 7:41pm

High Availability Considerations for SNO’s

I’m sure all SNO’s strive for their storagenodes to have the best uptime possible, which can be quite a challenge working on a limited budget with limit hardware, these are some of the considerations, implementations and compromises i’ve made in the setup of my server.

Electrical supply, i know my supply is highly stable to the point where my only considerations are getting some mains filters and fuses to ensure against overvoltage and the likes.
stable power is critical for reliable operation of most electronics.

Another thing i have considered is a small UPS type solution for safe shutdowns of the server in case of power outtages, but since i’m using a Copy on Write filesystem, i should see minimal effect from power losses and so there ill leave that to be better explain by those actually requiring and or utilizing those power stabilization technologies.

And don’t forget grounding, i never really grounded my consumer grade computers and never really had much issue with that, but it can be a critical point of trouble which will rear its head in the weirdest of ways, especially when dealing with long copper based network cables, multiple power supplies and various devices hooked up right and left, at best it’s errors, at worst something will give up the magical blue pixie smoke… alas moving on

Power supplies, i’ve dealt with a lot of bad power supplies, ranging from the cheap which would act up at any sign of electrical noise… i’m also told that a large portion of harddrives often fail due to bad power supplies, your power supply or PSU is the last line of defense for your computer, it is the component that keeps everything alive and running smoothly, having a problem here or being cheap rarely pays off in the long run, some servers do have multiple PSU’s because they can wear out, especially if they are overloaded, loading any PSU beyond 75% of it’s capacity shouldn’t be recommended, it’s rarely worth it… tho i have had expensive power supplies that just wouldn’t die…
running at 125% of rated wattage, burning them to nearly a crisp, running with broken fans and what not… it’s quite impressive what well designed gear can take.

Many HA servers will have dual PSU’s allowing for zero DT hotswap, which is nice, but also you will have a second PSU increasing the expense of gear and wattage consumed, i’m not aware how bad the added wattage use actually is, but i’m confident that it’s not zero… but if i was to hazard a guess maybe 25-50watts extra power draw for a dual PSU solution, and if you got a quality PSU running at 60% or 50% capacity, then PSU failure seems to be very rare cases in my experience.

Networking__________________
A few ground rules… if you are utilizing a UPS solution or other mains power filters, relays, fuses and such to protect your gear from electricity, then this is yet another critical place to pay attention.

When you are attaching long copper cables to anything you will have voltage differences… basically there is more energy in one place than another and it will crawl across metal, wires even materials that you might consider non conducting like paper, air and plastic will often allow the flow of electricity.

ofc like most know some metals like copper are the worst, which is why we use it for moving electricity, but this means any electricity… meaning if the wind blows over your house and your network cable, phone wire or such goes far away, you can in mere seconds have voltage differentials of thousands of volts, it might just give you a static jolt when it pierces your skin, and tho your NIC …

Network Interface Card is usually pretty well made to deal with such things… then stuff like a lightning strike means all bets are off, it will fry you UPS protected, no matter the surge protectors and filters you have on it…

the network is you ingress point of not only data, but also a secondary path to killing your server…

the only true way to mitigate such an issue is wireless, fiber optic and such non conductive solutions.

i know many will frown upon me saying wireless… but wifi isn’t always bad… it really depends on how much radio noise or distance you have (lets just call it radio, this is getting long enough as it is… ) , what kind of walls and in the way… so in some cases wifi maybe a cheap solution while in others it will be totally useless… fiber is the preferred professional solution, if possible ofc… there are ofc a plethora of different variations of these technologies, but often it always boils down to those two…
.
i must admit i ended up pulling a TP cable for running 1Gbit… almost without thinking ahead…even got cable good enough that i can do 10Gbit and most likely 40Gbit,

alas in 2020 hindsight i should have setup a fiber connection and then had two switchs with fiber uplinks in either end, would have been the sensible solution… most likely also cheaper with fiber, because 10gbit + “ethernet” as most call it, but it really isn’t named, ill call it twisted pair because a better definition escapes me atm… is simply ridiculously priced, so you are most likely better off avoiding it and doing some fiber uplinks, maybe use multiple 1gbit connections… most switching gear can handle many many gbit, so the cheap way is to hook up a 4x 1gbit connections into the switch and then uplink that away over 10gbit fiber … not the best solution… but it works if you got the network gear for it.
ofc it won’t cut down on your power bill i bet… but it’s an easy patch and really how often do people need more than 4gbit bandwidth.
this also allows you some failover in case a NIC goes bad, but again pretty rare on quality gear…
we will touch more on network failover when we get to the actual configuration of the HA server itself.

The HA Server Hardware.
Servers are generally build for specific purposes, much like your regular consumer computer, there will be all kinds, just like a car, many purpose built but still kinda the same,

Most server will have some integrated HA solutions that the Enterprise and Prosumer have demanded over the decades since the rise of the internet… and others may be designed with HA in mind…

generally HA boils down to having redundancy and quality which in the long run means that the system will end up a practically useless zombie system that only collectors will want, if the system if of proper quality in the first place… and isn’t replace for other reasons before that time…

Ill be describing the HA build into my 2U rack mounted server, in general terms… first off lets stick to going through this systematically and work our way from the outside in…
Cooling, most of a server’s component will have passive cooling radiators or fins if you like… thus basically maintenance free, the air flow is then provided by a setup of 3 to 5+ powerful fans that are easily replaced, because fans have a tendency to wear out… especially when being run at near their max recommended speeds… with 3-5 fans you can afford to loose a few without the system ending up no cooling and thus providing a stable cooling platform.

RAM usually servers utilize ECC RAM which which enable them to cheaply and quickly perform scrubbing of the memory, this feature corrects bit flips and such corruptions of memory data which can happen from time to time, this also allows the server to basically disregard an entire memory block either ram module should it go bad, not really that familiar with that feature, but i suppose there must be a spare of whatever the thing is… yet another redundancy feature to allow for increased HA.

on top of this then comes whatever else options you set in your bios configuration, this is also where the scrubbing features are turned on and off, scrubbing will take some work to do and thus for high performance servers people might turn it off, in the case of SNO’s the general recommendation is Patrol scrubbing which periodically perform a scrub, other options are demand scrubs which will scrub any accessed data when the data is requested, thus giving an even larger performance penalty, this might be relevant for some workloads, but is in the most practical sense irrelevant for SNO’s, so leave that disabled.

On top of all this there is the RAM spare function, this is basically the RAID 5 of RAM, i assume the ECC features also play a part in this, but that’s not really relevant for us to know… meaning i don’t think you can run this feature with regular RAM, the spare function basically makes so that the ram modules located on 1 channel is a spare… and using parity math to be able to recalculate lost data… or that’s how i think it works… sadly this means that one will always sacrifice 1/3 of the installed memory capacity for this functions… but if i choose to run this i should be able to hot pull / unplug while system is running 4 of my 12 RAM modules and the system wouldn’t care… aside form start howling with alarms and some decreases in overall performance…

personally i think this is overkill for a SNO, but if your system is located in a difficult to reach location, or your travel a lot, it might a good choice… costly feature tho, ofc it also provides yet another layer of redundancy against data corruption, but i never had much trouble with RAM so … disabled it is for me.
might enable it if i left for a month tho…

Networking HA Segment 2
Most server have at the very least 2 NIC’s often located on two different chip or chipsets to provide yet another failover option for the correctly configured server, in my case i got 4 NICs 2 on their own chip and one on it’s own chipset… in more modern servers one would most likely also have optical, but my server is simply that old… lol

One can aggregate two connections into one, with options like microsofts multiplexor functions or aggregate in linux, these options are pretty “easily” configured these days, but when it comes to load balancing and such not always optimal, kinda depends a bit on the luck of the draw, haven’t been very impressed with microsoft’s multiplexor or whatever its named… personally i would recommend a feature that keeps an eye on the connection and if the connection is down, then it will try to utilize the NIC located on the other chip or chipset… in case the chip / chipset is damaged or otherwise inoperable…

but in theory the aggregated load balancing NIC’s is preferred because this will double your bandwidth in and out of the server while giving you the same redundancy, when it works correctly… in some cases when poorly configured the 1 NIC going down, will then affect the connection of the other, which basically just means instead of halving your potential for NIC downtime, you essentially doubled your odds of something going wrong… so be sure it works like it’s suppose to when using these loadbalancing multiple NIC solutions.

Using a local DNS name for the server will make sure that no matter the IP the server has the data will be routed to the correct IP for your storagenode, ofc this is also highly dependent on your access to your router/dhcp server and or it’s support for utilizing routing to DNS names.
can be rather useful when dealing with loadbalancing and failover since in such cases ip addresses might be a bit in flux.

CPUs
The CPU in my experience rarely goes bad, kinda like RAM, modern chip technology is pretty reliable, ofc i will assume that nobody gets the first generation tech, because new stuff will always have gremlins, so look out for that… if you are thinking in HA terms or simply like your tech to work…

corporations have a tendency to utilize end users as beta testers, even tho this is sort of a bad for business, it can be difficult to simulate tests on a wide enough scale to avoid this…
so assume 1 gen of new tech will be buggy and avoid if possible.

“Many” server will come with 2 CPUs and tho this is mostly for added performance, it can also provide additional redundancy for your HA system, granted this is not always be the case, so be mindful of that and if this is a consideration for you, then make sure each CPU connects individually to the rest of the system, also this doesn’t make either CPU redundant, they will work in harmony with each other sharing RAM and data between each other, however in the case of one of them critically failing then a hard reboot should bring the system back online in most cases… tho this does require some things like NUMA and maybe special RAM interleave configurations , so that the system has permission to assign some RAM with individual cores and thus basically creating a virtual computer within the system itself, ignoring all the other components it’s not talking to, which in that case essentially could be disregarded and this if corrupted or down, they wouldn’t keep the rest of the system down…
which leads us to the next logical step of HA

BIOS Options and Configuration.
I will here be going over some of the things we have already gone through, but for future searching and partially lookup of good procedure., i will briefly go over them again.

The Watchdog, any HA minded Server will come with a HW watchdog, the dog is feed by something, in my case the OS, but it’s very easy to adapt it to be feed when something is running… if the dog isn’t feed it will perform an action… such as a hard reset of the system, turning it off and back on again… or something in that regards…

there are usually a few of these options… not sure what is best… i would prefer a hard reset, so that power isn’t cut to the system… but in some cases this might be wanted… but if we imagine the system running into some sort of issue that won’t be solved… then with a power cycle it will sit there turning itself on and off… maybe every 2-5 minutes… so 12 times in a hour so 288 times in 24 hours…

so lets imaging your unable to get to it in less than 36 hours… it will have spun the hdd’s up about 500 times, while on a hard reset… they will just have been spinning idle which would then not be that bad… the tiny motors in a hdd suck down 10 times the power on spin up… and this is a critical point of failure, but i suppose this really belongs in the section relating to the hdd.

but wanted to give you an example to why i prefer a hard reset to a power cycle.
as it could quite possibly break your machine for no good reason at all aside from a bad configuration, while on hard reset it will at worst have down time…

The one major disadvantage with the hard reset is that some things that a full power down could solve, would not be solved…

However i digress…
Watchdog will basically reboot your system if any issues arise where the BIOS (which is essentially a computer system within your computer system) looses contact with your operating system or otherwise and thus resets the system… this would also work in case of CPU failures or issues like touched upon previously in the CPU section.

There are also configuration to split your QPI (quick path interface i think it is…) up from 20bit across all paths to using each channel individually at 5bit, this would in such a case protect against lanes failing, i haven’t set this up myself, but might test it out in the future… but lets be honest… its basically a bus… whats a bus… well its a big thing you put smaller things into or onto…

basically in this case it’s basically wires… which i would assume is inside the circuitry of the motherboard… so yeah… if you have your computer running on the northpole and can only come repair it every 6months… maybe i would use this…

but it’s yet another HA option that intel in this case provides… not very familiar with AMDs stuff here… i’m sure it’s good… kinda… i like that they sort of just said… hey lets just print more cheap die’s and slap them onto big chips so we can get ridiculously many cores… and intel was like… YOU DID WHAT!!!

But AMD is still kinda the underdog, in many other aspects, so for HA i would stick with Intel or better.
50k $ powerpc chips like NASA usually uses…

wait what was i talking about again…

BIOS right…

compiling this has become a bit of a project, so lots of distractions, long pauses, multiple days and a bit of research going on behind the scenes, but mostly this is of the top of my head, and relies on my understanding of these topics… and i would ofc like this to be as accurate as possible, so if you think you can empirically prove something wrong, then ill be willing to look into it further.
like most people i know that one has to be more wrong, to be right… xD

Back to BIOS
this should really be a part of the watchdog thing, but then i have to start copy pasting…

Restore on AC Power Loss or so it’s called in my BIOS, might have many different names, i forget.

Usually for this i like to use Last Power State… nah Last State its called in this case atleast…
this does so that if you turn off the server it won’t spring back to life if you have a power outage, disconnect and reconnect a power cable or what not.

which is kinda nice… and it will also spring back to life after a power outage if it was turned on when the power went out… so thats pretty nice… not a big fan of the others… i was running default on right now… but i actually thought that would keep turning the server back on if was turned off… but from how the name on the bios option sounds, i kinda doubt it now… maybe i should have read that when i set it… lol ill give that a test soon thats for sure… i’ve had this issue with the server just shutting down randomly turned out it was some sort of power conservation feature that i had turned on… but to try and remedy it i figured it would try to other option than Last State which has been my preferred BIOS setup for maybe a decade now… if i run through a bios i will set it to that, because why not… its the most sensible option…

Boot devices…

I would recommend setting your Primary Boot drive and your Secondary Boot drive and whatever else redundant Boots you might have… but personally i will stick with Primary and Secondary.
these days i split them so one is on the HBA and one is on the onboard mobo SATA/SAS controller.

that will cost some extra bandwidth on the bus, but if either controller fails then the secondary will pick up during reboot, the main reason to disable all the others are if you add a drive during your uptime, the boot sequence can get displaced and the system might be unable to Boot if it crashes.
thus i would recommend booting primarily off the onboard motherboard controller and then from an HBA based drive in case the other one fails.

Personally i like to have a boot drive not located on the HBA, basically anything directly on the motherboard, i also don’t like booting on RAID arrays in case i have to find and correct issues with the RAID, thus the RAID can fail without the OS being affected, running an OS on a mirrored array can be a very good idea, this gives you a few more options for added redundancy, also if you do setup a mirrored array across different controllers.
(note that a mirror array isn’t raid… the system / controller just copies / mirror / clones the data on either drive, so one drive will work fine… however one thing to keep in mind here is that a bad drive in a mirror can great decrease performance…)

and remember to enable NUMA and Patrol scrubbing (if you have ECC memory)

I’m sure there are lots of more detailed advice for this, this is just the things i have learned to account, so thats my recommendations for HA considerations.

Storage.
I’ve chosen to go with ZFS, for my storagenode / server and i would recommend it for anyone that doesn’t mind stuff being quite technical, for those of you that want to keep it more easy to manage i would go with a Raid6 sadly you will need 5 hdd and more like 8 else you are most likely better off just running multiple nodes, but that doesn’t really make them very HA, you could do a raid5 with 3 hdd, but raid5 is quite flawed and really your safe choice is either mirrors or raid6, and then you need to be sure your raid controller either has a battery or a flash memory to protect against write holes from power outages…

something ZFS solves by being CoW Copy on Write… basically it has some pointers, and it doesn’t over write data… it copies or added the new data and then finishes off by correcting the pointer… thus if you loose power in the middle of something… the pointer isn’t updated and points to the old file… and thus you lost data… but you didn’t corrupt your data… which is the future of any file system, everything less is simply archaic by now.

after long study of raid i would say a raidz2 with 8 drives x2 is the array i would recommend for a storagenode pool and anything less… well sorry might not it or certain isn’t HA
ofc we live in a world of compromises and i an currently running a pool of raidz1 x2
x2 for the double hdd IO and raidz1 for some redundancy… but raidz1 is kinda dangerous, so i wouldn’t recommend it… but lets leave it there…raidz2 with 8 drive should be much more redundant than any well monitored system needs to be… ofc i’m only two months into using zfs… and a few years into really using raid… so not really my place to tell you raidz1 is safe… even if i kinda think it should be…

but raidz2 is quite safe… so lets call that HA for a storagenode, this also buys you some time to replace a drive… even tho you really shouldn’t… a broken drive should trigger a global hotspare resilvering.

anything less is asking for trouble, and if you don’t replace a failed drive you are asking for your array to fail… plain and simple…
he said without a hot spare for his raidz1… xD

Alarms and Monitoring. - the death of hardware…
This is turning out to be a bit more extensive than i first assumed…

seems like we have finally come to the conclusion of hardware considerations,
however a proper HA hardware setup is only half the battle. if even that…
this is a war on downtime and really the primary causes of downtime are the things
we didn’t account for, we also have the more external environment of the server to deal with.
as these can be just as contributory in your overall system downtime.

however inside the system we have many different components that needs to run like clockwork,
and to be sure they do some, we will need to have some monitoring of these components.
we want to monitor CPU temps, HDD temps, latency, fan speeds, hdd smart, raid array status.

these values we will log for future reference so we can troubleshoot, and tho this logging isn’t strictly
required it can be very useful for attempting to predict something like disk failures when noticing their

temps seems to be increasing outside the usual temperatures or into non recommended levels.
that way we gain the ability to predict potential problems ahead of time, such as dust filters being clogged and system gaining in temperature.

ofc it’s impossible to keep track of all this which is where alarms come in… when we have determined acceptable tolerances, we setup alarms which should be either emailed or sent by SMS preferably from a remote system concurrently tracking internet downtime and the likes.

tho these features can ofc mostly integrated into the system and then downtime tracked by some other service… we want notifications of unwanted behavior, unscheduled reboots, and such, but we don’t want so many notifications that we end up ignoring them, alarms are only worth anything if we listen to them.

also it can be easy to actually make the system so redundant and HA that it will just keep spinning without us having to do anything, but when the redundancy is worn down it will eventually die hard, if we don’t have proper procedures in place.

HA and automatic updates…
i know storj promotes automatic updates and the system seems to run fairly smooth, but it’s difficult to argue against that systems rarely get unstable on their own… it is most often related to updates or updates failing

Segments being added in the future…
snapshots / bootloader
OS

(revision 1.0 - sorry this is still a bit of a mess, ill attempt to make it a bit easier to get the just of it while scrolling through it)

stuberman · July 16, 2020, 12:57pm

I like the depth and thought going into your notes, there are good ideas and comments in the post.

I prefer to take a risk based approach since one size does not fill all and as you noted your situation does not have a power problem.

I think the first thing any SNO should consider is surge suppression, since it can easily cause damage to get a spike (lightning, transformer problems, etc) and is very cheap. A cheap surge suppressor is an easy, even if not perfect fix. What I did was use a ‘whole house’ surge suppressor which runs around $100 but most people would need an electrician to install it.

The second most important consideration, in my mind due to the very high uptime requirements for Storj, is a continuous duty HDD since this is the heart of Storj and you are demanding 24x7 continuous use. My preference is the Western Digital HC5xx series data center drives (UltraStar/Gold/HGST larger than 8 TB)

The third most important consideration, for me, is UPS (which can replace a cheap surge suppressor) which can be had for decent quality starting around $50 for a good Line Interactive model to condition the power and provide a way to gracefully shutdown a system during a power outage. This would benefit the Storj databases from corruption (along with other measures such as HDD settings such as disabling write caching).

kevink · July 16, 2020, 1:08pm

When I recently (re)invested into STORJ, I was thinking about which kind of HDD I want to buy. Going for a high quality data center drive with 16TB could easily get to 400-600€ per HDD depending on the model. Of course I’d have 5 years warranty but if it goes bad, my node would be lost too and I’d have to start again.
So I decided to buy cheap WD Mybook external drives and shucked them, using the HDDs as internal drives. They are white label Ultrastar HDDs so they are kind of server grade HDDs. They do have only 2 years warranty iirc.
However, they only cost me 150€ per 8TB HDD so I bought 3 for the same price as one 16TB HDD and run it in a raidz1 and have 16TB available too.
So this way I hope it’ll be more reliable than a single 16TB drive for the same price. (power consumption is of course tripled but that’s a minor factor imho).

BrightSilence · July 16, 2020, 1:12pm

Buying those external HDD’s has been a great trick to get cheap drives. But it’s always a gamble to see what you get. And these days that gamble includes a very good chance of getting an SMR drive. Which I don’t even blame them for as USB drives aren’t meant for 24/7 use anyway. It’s actually a usage scenario that makes a lot of sense for SMR. So buyer beware if you try this trick these days.

kevink · July 16, 2020, 1:13pm

not with WD Mybooks. Those are known to contain Ultrastar HE drives and WD doesn’t sell any SMR drive with sizes >=8TB.

But generally I agree and would advise against buying external drives unless you know exactly what model they contain.

BrightSilence · July 16, 2020, 1:15pm

That’s good to know. A good google search before purchase would help with that as well. I think SMR in USB HDDs is going to become very common practice soon though.

Edit: Btw, ultrastar or data center drives are no guarantee for CMR. There’s these as well: https://www.westerndigital.com/products/data-center-drives/ultrastar-dc-hc600-series-hdd

Those are also larger than 8TB. So your statement isn’t entirely true. Just do your research beforehand.

SGC · July 16, 2020, 2:07pm

saw that there is now a 100TB ssd in what looked to be 2.5", ofc the price tag is absurd… but prices will without a doubt drop, i think in a year or two ssd’s might start to be relevant contenders for affordable storage… sure for now the price gap is like x5… but x5 isn’t that bad… doesn’t take a ton of additional factors to create good reasons to use ssd’s for storage, also i like the soft failure of ssd’s… doesn’t really have the

one mechanical component failed and took the rest of the drive down with it… or it becomes much less likely for ssd’s

@stuberman also on the topic of surge protection, then it might help… but then again… if you have quality hardware and haven’t had issues… might be less of an issue… ofc as the price of the setup keeps going up… minor stuff like that becomes a worthwhile insurance…

ended up watching this one the other day…
he does some very interesting videos on random tech one doesn’t really think about… which i kinda like.
this is a neat little setup, even if very ghetto

also learned recently that the red button glowing on off button on extension cords, actually show if the surge protection circuit is bad… it will start to flicker… and one has to replace surge protection circuits ever few years because they will wear out … ofc power conditioners are an options… but really then i rather just put money into solar and like his ghetto setup use a charge controller to charge the battery bank which an inverter uses to power the system with…

that would essentially do away with almost all possible issues… i would almost say it could survive a lightning strike… but few things can… and rarely stuff with wires… so imo surge protection is okay… but don’t count on it always protecting you… but i suppose 10% odds are better than 100% odds of failure for the same event

stuberman · July 16, 2020, 4:23pm

I enjoyed his video until I got to the end where he showed that he also had a UPS running… why didn’t he just add the big battery to the UPS to provide extended up time? He really did not need to buy a $300 inverter and a trickle charge when the UPS does both.

twl · July 16, 2020, 4:40pm

not yet, that is

It’s not unimagenable they could change that within the next production batch and you could totally get whatever HDD model (probably one you don’t want) when buying an external drive.