Downtime more than 5 hours , afters years of working

Pentium100 · September 23, 2020, 6:35am

But it sometimes does. For example, the last few freezes of my server were caused by defective RAM, problems with the HDD backplane or a combination of both (I have replaced a few ram sticks and so far no problems, I have not replaced the backplane yet).

Or, some servers I know sometimes froze (sometimes = once 2-6 months). Disabling C states solved that.

I have another server that sometimes reboots (sometimes = once every few hours or once every few months), but in a way that makes the BMC reboot as well. It worked OK before and I was not able to induce the problem, even making the server run really hot and trying to poke at various components (in case there was a bad connection somewhere). Because the server reboots and does not freeze, I guess I could use it for something that does not mind a reboot.

SGC · September 23, 2020, 6:55am

thats interesting… i been playing with c state for a bit, now with my new ssd it was recommended to turn it off because it would introduce latency xD but i barely see any power savings from having it turned on… been testing it for 2 months or more … having it turned on and off many times for weeks and monitored wattage draw… its been like 10w difference… out of 270 or so watts… seems very useless…

but yeah
computers are a cardhouse… even the smallest of things can affect such things… cars drive past in the street, people walking by, loud sounds… so many things cause vibrations so having a controlled environment is pretty important.

how would ram make the computer freeze… i mean ECC feature should make it so it has to be an entire RAM module that went bad at once… and that should just cause a reboot
kinda been wanting to set my server to run in spare module for my ram… but 1/3 loss of capacity i just cannot accept with my present ram limitations.

my server seems stable as a rock… i’m very impressed by Tyan the mobo manufacturer, it was also the brand linus from LTT ended up using for his many gamers on one cpu setup, because the other brands he used wasn’t really server mobo manufacturers and thus was subpar…
and i’ve been so mean to it… i mean its like full of dead bugs and critters and much of the board has corrosion from poor environmentals

periodical errors are the worst so much nicer when stuff just is at fault so one can trouble shoot it and know when its fixed… finding intermittent errors is a hellscape of suffering and dead ends.
also why i really like that the tyan mobo basically has two of everything… every chip and connection can be or is split up into multiple paths so that the server can run without either and thus it can be online for the troubleshooting which makes it partly digital and thus so much easier.

my ram tho… has this part of my ram that losses connection and 8gb ram just vanish… doesn’t seem to happen when its running… only when i accidentally touch the ram modules … but it’s always 8gb even tho i got 4 gb modules and in sets of 3… not sure what thats about… but easily fixed and doesn’t really hurt the server…and it can ofc just boot without them… so meh… haven’t been bothered by trying to figure out whats wrong… most likely more corrosion in the ram slots… or dead bugs… of which there are a few but trying to control the environment much better now …

Pentium100 · September 23, 2020, 7:00am

No idea and it may not be related. The server froze, I rebooted it and it worked for while then froze again. This time it did not detect the hard drives, so I removed all of them, booted the OS and then inserted them one-by-one. All were detected, except one. I inserted that drive to another slot and it was detected.
Then the server rebooted and this time BIOS displayed “uncorrectable ECC error” in one of the small RAM sticks. There are three of them, I replaced all three. Since then the server has not frozen or rebooted.

SGC · September 23, 2020, 7:13am

the ECC RAM has a built in chip level redundancy… but i suppose one chip may be bad and then a second one may be acting up… read a study on RAM durability… it seems new RAM is more prone to problems and then as they age and the bad RAM modules are replaced they eventually reach a state where errors are very unlikely… survival of the fittest i suppose…
but yeah even RAM does have a limit lifespan i guess… and since i run zfs on everything i suppose the l2arc may give me an extra level of redundancy… not sure. i mean if ram data is corrupt it would most likely just get it from the l2arc

Pentium100 · September 23, 2020, 7:34am

I don’t think so. If RAM data gets corrupt zfs may not notice it. This is why it is important to use ECC RAM with ZFS.

SGC · September 23, 2020, 7:51am

i think it also checksums data in the arc… but yeah ECC is important whats a checksum worth if it can have errors… but really the indepth of exactly how zfs does this stuff is beyond me… i was just saying i think if it failed a checksum on data in memory it would simply get it from the l2arc if it was possible… ofc thats not always an option as some data in memory may be so “live” that storing it to a l2arc is basically impossible… but in many cases i believe it might well retrieve data like if we say a full ram block drops out… could it recover… well ofc the system may stall… but thats also a matter of ram configuration…

if all the ram are used interlinked for performance, then an error in 1 will be an error in all of the data across the entire “interlink” i forget what its called… its when we set how many channels the ram utilizes… so that it may shotgun the data to all the ram instead of sending 1 data block to each module…

been thinking of reducing mine… but when i was testing it, it was severely noticeable on server performance, so i left them all interlinked.

old ram, but i got 12 modules, so that helps a lot with ramping up to a reasonable speed.
kevink is running zfs without ecc, he lost a file or it got corrupted while stored in memory before it got committed to disk… thus far i haven’t seen anything like that, but i’m ofc on ECC

i know that if i disrupt my pool the l2arc will basically keep the server / OS running and gives me much more time to restore the pool… ofc we are talking a few minutes here and ofc one cannot read from the pool when it’s offline… but the l2arc and a slog basically does so that if i pull more drives than the redundancy allows and then replaces them again after wards… the system just continues without a hitch mostly … complains a lot tho…

and doesn’t seem to loose a byte… not yet anyways… only done it like 5 -6 times for fun or because i made a mistake in which drives i was removing mainly because zfs doesn’t have that led blinking ability to tell me which drive i need to remove like normal raid cards usually have… tsk tsk

my faith in zfs is pretty great at this point lol

Pentium100 · September 23, 2020, 7:57am

No.

Our results for memory corruptions indicate cases where bad data is returned to the user, operations silently fail, and the whole system crashes. Our
probability analysis shows that one single bit flip has
small but non-negligible chances to cause failures such
as reading/writing corrupt data and system crashing.

Other filesystems are the same in this regard - corrupt memory will result in corrupt data.

SGC · September 23, 2020, 8:20am

that’s interesting… tho they do recommend end to end checksums and the paper is like 10 years old… or it doesn’t say exactly how old it is… but latest reference is from 2009
could very well have been improved upon, but i duno… don’t really want to argument one way or another… i duno and i got no clue either way…

i’m sure it’s fine or we would all be running our servers on a spare ram setup, so that there is a spare module per … whats a cluster of ram modules called… channel whatever… i got 3 modules in each of mine a 4 of them 2 on each cpu, if i run a spare setup, then 1 of the 3 is redundant in each … channel
and the ecc makes a chip level redundancy and bit or byte level redundancy.
and ofc the periodically scrub would also catch stuff like that and purge it or re collect the data from storage.

the odds of the OS being affected out of all the data going through the ram is minimal… ofc if we want 100% data intrigrity it’s important…

but still took kevink like 2-3 months to even loose a single file without ecc…
and just because the data is damage in ram, doesn’t mean its damage on the storage… nor that it will be… only relevant until it goes onto storage… which for me is less than 1 ms now for the same reason… then the ram data is basically redundant… because anything in arc and l2arc can usually be dropped without affecting operability

but yeah without a doubt ram is a very integral part of the server and thus problems there will spread like wildfire.

Pentium100 · September 23, 2020, 8:43am

It would make things slower and ECC RAM already has, well, ECC, so why do it twice?

It’s 5 seconds by default on zfs. SLOG does not count, as it is not read during normal operation (only after a crash).

SGC · September 23, 2020, 11:21am

the slog will also be used if there is a failed checksum.

The 5 sec is only the max, which granted my setup will use all the time because i run sync always to limit hdd random io and pool fragmentation.

when i do an iostat -w 3600 i will rarely have high write wait times… granted there will always be some of those minutes long waits, if i’m like doing a scrub or such… but the vast majority of the write waits will be into the ms range… tho i’m sure that will count the slog… not sure how to check it without seeing the slog writes.

ofc my new slog / l2arc pcie device haven’t been active all that long, so my numbers are still a bit unknown, but it’s a vast improvement from the two older sata ssd’s i had trying to do slog… one of them would end up with latency of 125ms… now it looks to be sub ms range 24/7

zfs is very advanced, i cannot say exactly how it all works… i doubt neither of us truly understand it, ofc that doesn’t mean you aren’t right… which sounds reasonable, but i doubt it can really be simplified into 3 words but maybe

end to end checksums would involve them being checked and verified each step of the way, even through RAM, ofc the real issue would be with that is that if it can detect it and the data only is in ram… then it has no way to fix it, it can only report it…

and some writes can at times wait for minutes before being written from memory to the pool, and thus in such cases of large queries then the chances of bit errors in memory would go up exponentially, which is why i like it to go directly to my slog always, and then if the system discovers an checksum issue at one point it will be able to go back and find a backup essentially… even if it’s only like a 5sec period… its still essentially a backup of the data.

just like the l2arc will limit writes and reads on the pool and thus reduce latency and disk wear, and additionally if there ever is issues with the pool, in most cases lots of the stuff it needs will be in the l2arc, even if it takes weeks or months to fully warm it up…
and again if some data is discovered to be corrupt, even if the hdd redundancy cannot fix it… then in some cases it could well be sitting around in the l2arc.

and ofc it doesn’t hurt that it makes everything faster…
i would be very interested to know if ram still is the primary failure point for zfs… because thats an easily mitigated issue… and if that is the case i might consider that…

Pentium100 · September 23, 2020, 12:13pm

SLOG is not read in normal operation.

Device             tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
sdp               4.73         0.02       261.45      15836  242461144
sdr               4.73         0.02       261.45      15836  242461144

These are the SLOG devices in my server. In 10 days it read 15MB and wrote 240GB…

As zfs does not do checksums for data stored in RAM, if the data gets corrupted, it will be written corrupted without checking the SLOG. SLOG is only read after an unclean shutdown.

StoreMe2 · September 23, 2020, 12:35pm

Hi,

today i decided to make a new clean OS install on my Server (Mintlinux 20) and a clean new docker and storj install.

I realised that the Linux installation was slow as hell. No error shown but i changed the OS SSD and then it seems to be on normal speed.

Don’t know what is exactly the error but it seems it depends on the OS SSD and the Maindboard (or cable).

I think i change the Mainboard and RAM and use a brand new SSD. The old was working 3,5 years and there is no error shown but it was very slow. It was working on ext4. Some SSD issue ?

Hoping this fixes my issue.

Working with BTRFS for my storagenode. Now i am thinking to work with btrfs for my OS SSD too with ECC RAM.

liffe · September 23, 2020, 3:42pm

Same thing happened to me. Had my node up and running for 11 months and suddenly these motherfuc*** auto-update tried to update it during the night and crashed my server + HD while sleeping.
Woke up to realize I got disqualified. STORJ will lag behind its competitors because of situations like these.

SGC · September 23, 2020, 4:08pm

some ssd’s can become very slow if filled beyond the 60% mark or less i believe 40% of total capacity partition / partitions is about the optimal choice for performance in most cases, ofc the types of ssd technology can be vastly different and thus YMMW

also never fill an ssd beyond 80% since it often requires room to rearrange data, tho in many cases today these over-provisioning will be done from the manufactures side, so in many cases it won’t be required, but then the 20% free space will simply ensure much improved performance…

the performance drops have to do with that modern are QLC but will store data with the cells working as SLC, but then it requires 4 times the room to store the data, and then ofc it will need to read it and make it again later and make it into QLC data… thus you end up with 5x space required … so if you put 100gb on a 500GB QLC SLC it will essentially utilize 400GB to save it initially and then another 100GB required to save it as QLC, thus copying 100GB directly to a 500GB QLC SSD is basically the maximum it can manage…

ofc its rare such a drive will be empty… so lets say 40% filled, making it 200GB spent leaving us 300GB which is QLC capacity, but when the cells work in SLC is only 60GB that can be moved into the drive at max performance…

so yeah… long story short
the new paradigm of SSD’s come with it’s own pro’s and con’s which is very foreign to users still thinking of storage in the way they use to with HDD and RAM

i can neither confirm nor deny that argument
but yeah i get that it is rarely read…
what command did you use to get that data tho… and why would you want to run mirrors on a redundant device that’s almost never used and which Copy on Write almost also makes even more redundant, seems so wasteful… sadly i cannot compare because my slog and l2arc are on the same device.

Pentium100 · September 23, 2020, 5:16pm

iostat

Because SLOG is important for reliability. If power failed AND the SLOG failed at the same time, data would be lost. That’s why it has to be a mirror.
L2ARC is just cache, so it does not need to be reliable.

You can get stats on a partition:
iostat sda1 you can specify multiple drives/partitions on one command.

For example, this is how it looks on another, not-Storj, server (partitions “1” are used as SLOG in mirror configuration, partitions “2” are L2ARC, no mirror). Uptime is 143 days

Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
sdf1              0.22         0.01        12.73     178096  157312884
sdf2             85.33      5061.43       364.82 62566115008 4509712662
sdg1              0.22         0.01        12.73     178096  157312884
sdg2             85.32      5059.92       363.98 62547398276 4499287234

SGC · September 23, 2020, 5:37pm

my overview looks vastly different

 zpool iostat fioa2 fioa1
                                                 capacity     operations     bandwidth
vdev                                           alloc   free   read  write   read  write
---------------------------------------------  -----  -----  -----  -----  -----  -----

fioa2                                          1.18M  5.50G      0     30      0   270K

fioa1                                           116G   784G      5      2  51.9K   382K
---------------------------------------------  -----  -----  -----  -----  -----  -----

gives me that uptime long avg output thats so useless in many cases…
you use any special parameters or maybe use another zfs? i forget what OS you are on

uptime is 4 days and fioa2 is slog partition and fioa1 is l2arc , the sizes are a bit off because i define it wrong and decided to just leave it and adjust it later… wanted to make the l2arc a clean 1tb xD
didn’t account for overhead i guess

in regard to mirrors i really should get my OS setup to run on a mirror, was kinda expecting a reinstall but didn’t end up doing that, and i would need two similar drives to run it from… so not sure what i’m going to do there… maybe some sort of backup that the system can boot from if the primary os partition fails to perform… but would be so much easier with a mirror… ofc atleast in regular raid if one mirror drive is bad, both is limited to it’s speed… not sure how zfs reacts in that case…

anyways i would much rather mirror the os partition than the slog, but ofc my data isn’t really that critical either… that i feel that i need a mirrored slog, but i guess if one doesn’t want to keep track of a system the mirrored slog does ensure that atleast one ssd will work… ofc my slog ssd already has ecc and a hybrid raid 5 type deal on a chip level or something like that… and should be self healing from pre allocated space for bad sectors, so tho a mirror is better… then i don’t want to bother with the hardware for it…

i suppose my OS mirror could instead just be that i move the OS to the PCIe SSD and then use a USB for the boot partition, and then i could put a similar boot partition across all my drives, so if the USB fails it will simply utilize those, i forget how zfs deals with that, but atleast when proxmox has set it up, it seems to make it so all drives in the pool contains the boot partition…

not really what people seem to recommend tho… but as always there is room for improvement.

twl · September 23, 2020, 5:42pm

Even using a remotely controlled smart plug is more than you can expect from the usual SNO, in case Storj wants to attract home users as well.

Pentium100 · September 23, 2020, 6:09pm

it’s not “zpool iostat”, just “iostat”

SGC · September 23, 2020, 6:24pm

i guess i would need to install that then… all i get is command not found.

Pentium100 · September 23, 2020, 6:25pm

apt-get install sysstat