Doh.. Looks like hardware failure coming up!

penfold · November 19, 2021, 4:03pm

So, this is not my node running on the disk with known bad blocks.

This is my 6 year old WD Black 3TB - but it looks like it has started dying…
The disk with known bad blocks only has 20GB stored and hasn’t hit any yet.

i might be going back to SAS RAID1 mirrors I think. lol

[51046.699633] blk_update_request: I/O error, dev sdd, sector 886706088 op 0x0:(READ) flags 0x80700 phys_seg 21 prio class 0
[51046.699715] ata4: EH complete
[51048.797183] ata4.00: exception Emask 0x0 SAct 0x1000000 SErr 0x0 action 0x0
[51048.797194] ata4.00: irq_stat 0x40000008
[51048.797202] ata4.00: failed command: READ FPDMA QUEUED
[51048.797217] ata4.00: cmd 60/08:c0:a8:0f:da/00:00:34:00:00/40 tag 24 ncq dma 4096 in
res 41/40:00:a8:0f:da/00:00:34:00:00/40 Emask 0x409 (media error)
[51048.797223] ata4.00: status: { DRDY ERR }
[51048.797228] ata4.00: error: { UNC }
[51048.799745] ata4.00: configured for UDMA/133
[51048.799785] sd 3:0:0:0: [sdd] tag#24 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[51048.799794] sd 3:0:0:0: [sdd] tag#24 Sense Key : Medium Error [current]
[51048.799800] sd 3:0:0:0: [sdd] tag#24 Add. Sense: Unrecovered read error - auto reallocate failed
[51048.799808] sd 3:0:0:0: [sdd] tag#24 CDB: Read(16) 88 00 00 00 00 00 34 da 0f a8 00 00 00 08 00 00
[51048.799815] blk_update_request: I/O error, dev sdd, sector 886706088 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
[51048.799882] ata4: EH complete
[51050.789287] ata4.00: exception Emask 0x0 SAct 0x4000800 SErr 0x0 action 0x0
[51050.789298] ata4.00: irq_stat 0x40000008
[51050.789306] ata4.00: failed command: READ FPDMA QUEUED
[51050.789320] ata4.00: cmd 60/08:d0:a8:0f:da/00:00:34:00:00/40 tag 26 ncq dma 4096 in
res 41/40:00:a8:0f:da/00:00:34:00:00/40 Emask 0x409 (media error)
[51050.789327] ata4.00: status: { DRDY ERR }
[51050.789331] ata4.00: error: { UNC }
[51050.791515] ata4.00: configured for UDMA/133
[51050.791565] sd 3:0:0:0: [sdd] tag#26 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[51050.791573] sd 3:0:0:0: [sdd] tag#26 Sense Key : Medium Error [current]
[51050.791579] sd 3:0:0:0: [sdd] tag#26 Add. Sense: Unrecovered read error - auto reallocate failed
[51050.791587] sd 3:0:0:0: [sdd] tag#26 CDB: Read(16) 88 00 00 00 00 00 34 da 0f a8 00 00 00 08 00 00
[51050.791594] blk_update_request: I/O error, dev sdd, sector 886706088 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0

Alexey · November 19, 2021, 4:07pm

Perhaps you need to safe/recover your data I guess.

penfold · November 19, 2021, 4:12pm

Nah, it’s not worth the effort really on an income basis. Certainly held back doesn’t count for anything. I already have another node on a SAS RAID1 going through vetting on the other internet connection. I’ll leave this one online until the end of the month then take it off and configure the new vetting node to use this gateway. RAID1 is just so much easier - replace the drive - done.

Alexey · November 19, 2021, 5:04pm

You can move it to other drive and continue to do not start from scratch. If you haven’t lost more than 10%, It could survive.
By the way, the just RAID1 could be not enough. Perhaps it’s better to have it with checksums at least.
But I would not recommend to use btrfs, unless it’s Synology

SGC · November 19, 2021, 9:35pm

a checksum file system like zfs can help give you a warning when stuff is going back and help keep data integrity, tho i would recommend raid, raid 1 is immensely wasteful… ofc its also often not practical to use large disk arrays for something like a storagenode.

new nodes are best run on single disks and if they survive then move them to more reliable storage solutions.

zfs will halt the entire file system when errors occur often i find errors are transient, disks are rarely just fully bad. most often they just act up from time to time, reseating / reconnecting them or rebooting the system often makes them go back into normal operation for months without issues.

so it can be really difficult to say when a disk is truly bad or not, but it comes often without warning, and often the disks that are just weird from time to time are not the first to die… no they ofc soldier on causing on more pain and anguish along the way.

i will never use a non checksum file system again, thats for sure… not if i can avoid it.
checksums allow your file system to detect errors early on and thus avoid or mitigate issues, and if the shit hits the fan with like a bad cable or connection they or atleast zfs halts everything until the signal is restored.

try to save the node see how it goes, starting over it a pain…

Pentium100 · November 19, 2021, 9:38pm

zfs is also good in that it is very easy to add another disk to it to go from single disk to mirror.

penfold · November 19, 2021, 11:57pm

Unfortunately the timing isn’t great for me to work on this. I’m right in the middle of a migration from AD to Azure for a client that is being more tight than they should be on the licensing to use.
We know for sure GPO’s and ACL’s are going to break.

It means doing a lot of testing on my side - and this comes a week later doing a cleanup after colleague stuffed up O365 licensing changes. They migrated a client to using a license that wasn’t compatible with a shared office install on Terminal Services RDP.

Our Linux projects work far smoother. lol I was hoping to spend some time working on Rancher this weekend but apparently not… lol

Alexey · November 20, 2021, 8:43am

Ha. I did exactly that I wanted to try it and has tried. Their RKE is amazing for on-prem plus Longhorn
Also I has tried the Nomad cluster with Ceph and Consul. It working too. And uses less resources for system services in comparing with k8s. The nice thing - it can run a regular binaries without containers!
But k3s could be an answer too, if you use only containers. Need to try deeply with a Rancher desktop.
The next step would be to convince our developers to make a native Storj CSI plugin, then I can use Storj for persistent volumes instead of Longhorn (k8s) or Ceph (Nomad).

Alexey · November 20, 2021, 9:12am

It makes sense only in RAID configurations, otherwise it’s slower than ext4 in 6 times (the simple single disk zfs and the same disk ext4).
Unfortunately in my tests with default options for raidz and LVM parity volumes I received not good results. On the same hardware and disks the raidz (3 disks) is slower than LVM parity (the same 3 disks and 3 columns) in 5 (!) times for the rsync and a continuous 1G and 2G files
I tested it 3 times cleaning env before the next test, the difference in results is less than 1%

penfold · November 20, 2021, 9:16am

Ceph I’d like to try out as well but there is just so much to do and not enough time. My mate back in Australia is hassling me to do some Openstack stuff for him as well.

Then there are the other computing projects I’d like to get back into as well. Like getting Debian to run on some of my old sun boxes. I tried booting Debian on my V120 but it failed miserably. OpenBSD and Solaris work fine so it isn’t the hardware.

No wonder the wife keeps catching me browsing Avito for the next server. lol
Probably the next one I will buy will be for Golem. So it will above 20 threads for sure.

I really, really want to get some of my Alpha’s here from AU in the next 6 months or so,
And I’d like to have an Itanium box to play with again as well. I used to support a HP-UX (yuck!) cluster running on SAN storage in one of my old jobs. I much prefer OpenVMS for that.

Alexey · November 20, 2021, 9:21am

I tried it a few years ago. But now it doesn’t look so amazing as were on that time
I would say the simplicity is the key. The bootstrap of OpenStack on-prem was not easy task, especially for the object storage and I have had some issues after all become working.

I would say that Debian become worse last years. I’m were forced to remove it from our clusters (other project unrelated to Storj) and replace to Ubuntu 20.04 to make them stable again.

penfold · November 20, 2021, 9:25am

That matches what I read online about it. But it will have to wait for a while until I get the next server back into use. When I bought it the motherboard was dead - and a replacement board from ebay was also faulty. 3rd time will hopefully work. lol

penfold · November 20, 2021, 9:49am

I’ve never had much luck with pure Debian. I also use Ubuntu for my servers (and also sometimes Linux Mint).

In general I’d also add that Linux on non x86/x64 architectures has always disappointed. I remember comparing Red Hat Linux (before RHEL) with OSF/1, aka Digital Unix aka Tru64 and DEC Unix just blew Linux out of the water.

I ran NetBSD on MIPS DECstation 5000 machines for many years. DEC5000-260
This was my 5000/260 with two 17" Sony monitors and 448MB ram - which might not seem too impressive - except it was made in the 1990’s! I have seen one of these machines with 12 000 shell accounts. Originally it ran Ultrix 4.5.

Alexey · November 20, 2021, 10:00am

I would not highlight this feature as zfs-only.
You can do that on any nowadays FS/LVM which supports RAID.
I would say that btrfs is working much faster than zfs and even pure ext4. But their RAID is a mess. And btrfs is still not production-ready. I would strongly recommend to avoid it in the next 5-10 years (accordingly their velocity on fixing a major bugs).
I managed to lost data in the test environment on btrfs single during expand, what’s never happened neither with zfs or LVM. I made a test emulates the need to move data from NTFS on single disk to btrfs/zfs/ext4(LVM) in-place for these three competitors:

btrfs
zfs
LVM + ext4

I was finally able to finish the test with btrfs in the second round, and did not lose data this time, but as we say, “we found spoons, but the sediment remained”, which means that something was lost when someone was your guest, but the lost items were found later (because they were elsewhere) and your guest was not to blame for the loss as you thought at the time.
In this context - I had to add more disk space (add another small empty partition) to complete the move without losing data. But for now I will remember that btrfs is extremely fragile and not ready for use.

But facts are: the fastest software RAID1 is LVM, then btrfs, then zfs.
However, the checksums and automatic fix of errors on the fly (but not automatic repairing of the disk) is a nice feature.
With emulated hole in one disk the LVM has lost data (I were not lucky and it decided to take exactly broken disk as a master), the zfs and btrfs are was succeed.
The fix is worked on zfs and btrfs pretty good.

The missing disk is a total failure for btrfs - it was unable to mount the mirror. I’m forced to use a special flag to mount it in degraded state. So if your root on mirror btrfs - you should prepare the boot from the LiveCD to fix an issue.
zfs and LVM was winning in this test.

SGC · November 20, 2021, 10:12am

running a checksum filesystem isn’t for the speed its for the data integrity, zfs speeds up some tasks but raw writes and suchs will always be slower due to more iops being written to the drives… also there is a 3 or 4x minimum io amplification if xattr and atime is enabled, which they are by default.
on top of that zfs will dedicated a fast area of the disk for ZIL which means doubling everything written because it will be written twice, which is pretty stupid it seems… and only way to avoid it is to have a SLOG device.

zfs really doesn’t do well in small setups.
also i stopped using rsync since my nodes are already on zfs so when i move them i used zfs send | zfs recv
which runs about 10 to 12 times faster than rsync on zfs.
what i really like about zfs is that its basically indestructible when maintained, i’m 21 months in and i have yet to loose a single byte that i didn’t delete by accident…
not for lack of disk failures and issues tho, i’ve have thousands of recorded disk errors, where zfs and redundancy saved the day

Alexey · November 20, 2021, 10:16am

I know all of that. As an argument I must say the btrfs has the same, but it’s at least in 2-3 times faster than zfs in the same level of RAID.
I also tested the RAID10 in all three systems, as it’s the only viable RAID for production databases.
The results have not changed: LVM raid, btrfs, zfs
BUT.
Since LVM raid have no checksum checks, and btrfs is unstable, there is no other choice. Maybe only Ceph.

the same for LVM snapshots (with extension like GitHub - davidbartonau/lvm-thin-sendrcv: Send and receive incremental / thin LVM snapshots on a live volume. Replication / synchronisation of an LVM volume to a remote server by transmitting only the difference between snapshots of a live / running volume.), btrfs snapshots. So, no win here.

I second that.
But still - it’s slow. You just need to account that into considerations and make changes accordingly.

SGC · November 20, 2021, 10:31am

ceph isn’t viable for most usages its even slower than zfs, does work well for a large network storage architecture tho… but after seeing some lectures from LHC and their issues with it… i’m fine with ZFS and when you compared it to stuff like LVM raid, then you totally missed whats the point in ZFS…

the whole design point in ZFS was to keep systems running indefinitely, ZFS is amazing i’ve been so mean to it and i have yet to have an issue with it… but yeah it’s a tank… it won’t win any races atleast at first… my tiny little zfs setup uses 29 hdd’s and it’s actually still to small for zfs to really make good sense… .but i’m getting there… lol
just need a couple more TB of SSD for caching only have 2TB thus far and a few more RAIDZ1’s

Pentium100 · November 20, 2021, 10:45am

I never used LVM as raid - it was always LVM on top of md-raid, ext4 on top of md-raid or zfs (or, in case of VMs, it’s usually ext4 on top of zfs).

LVM snapshots used to kill the write performance of the volume (at least on SATA SSDs). Maybe they were optimized since the time I used them, but when I used them (Centos 6), it was a problem.

Alexey · November 20, 2021, 10:54am

I know the weak points of LVM like lack of checksums checking. But I tested them more for home usage, not Enterprise, where all planned beforehand and you will not require the extend in a simple way.
For the home usage you do want to add disks as you growing and preferably at low cost. For the home usage the zfs is not so attractive in this meaning.

The weak points of zfs are very known for me too - RAM hungry, slow, not extendable in low cost (you always need to add at least mirrors of the equal disks, you cannot add only one disk or different sized disks to the same raidz pool, unless it’s a simple volume. I know the latest version of zfs is allow you to extend raidz pool, but the stripe will not change, so your raidz will be slow as before, adding disks will not increase speed as in usual raid5 and there is no rebalance, because in the Enterprise you will never do so - you will create a new pool, because your load must be without a downtime or significant speed impact). However, there is no alternative at the moment if you want to have a good protection against bitrot or RAID-hole (this one is not solved in the btrfs by the way).

The standard COW snapshots - sure. The speed become as slow as usual normal zfs speed.
But you can use thin volumes, their snapshots are fast as on zfs or btrfs and doesn’t consume space as the COW ones, your main volume will grow as usual, consuming space on difference but without affecting the speed.

Pentium100 · November 20, 2021, 12:04pm

I remember testing zfs and lvm speed (on SSDs) and the results were pretty similar. The result in practice was even better - instead of having to create a snapshot, copy it and delete it (because of performance impact), I can just keep multiple snapshots and only copy the differences.

But that is on SSDs. On HDDs zfs would be slower, since it writes in a somewhat random fashion and there is no way to defragment it. Writes are faster than reads, when it’s usually the opposite.