Setup Bcache on Debian 11 amd64 with mdadm devices

Eioz · October 24, 2022, 2:06pm

Hi everyone !!
Today iam testing bcache on debian 11.5 virtual machine. I will update this post with the given procedure to setup Bcache on debian 11.5 for running storj storagenode. If you are on Raspbian you will need to rebuild the operating system kernel with bcache : see Enable bcache on RaspberryPi 4 4G node I would like to know if someone use already bcache (since a long time) for storagenode caching on SSD in writeback mode ? If yes, could you give me your filesystem capacity, your cache capacity and your used cache ratio ?

I would like to setup bcache with mdadm devices to avoid data loss, i tested this configuration on virtual machine and it works ! make-bcache -C /dev/md1 -B /dev/md0 --discard --writeback (assuming /dev/md1 = SSDs RAID 1 and /dev/md0 = HDDs RAID 6).
I will look now for physical volume /dev/bcache0 to be extendable when adding more HDDs to be able to extend ext4 filesystem. (My storagenode contain RAID 6 of HDDs for 20TB of capacity and i plan to allocate maybe 40GB of cache with bcache in writeback mode).

If you use bcache try to answer my queries if possible, otherwise i will try to explain bcache on this topic for using it on amd64 (especially Debian 11.5). Thanks !!!

xopok · October 24, 2022, 2:43pm

Hi,

TL;DR; It probably doesn’t make much sense to have such a huge cache partition.

I use Debian 10 with bcache on my Microserver Gen10 Plus.
I have one 8TB RAID1 partition and two 6+6TB unmirrored partitions on the same two HDDs (due to historical reasons) for a total of 20TB of bcached space and out of it around 16TB of stored data (storj eating ≈75% of it).

My cache partition is 16GB (with writeback, sequential_cutoff=262144) and only less than 5% of it is actually used (just checked, currently it’s 3%).

Eioz · October 24, 2022, 2:47pm

Greetings and thank you very much xopok for this nice feedback Have a nice week !!! Eioz.

Toyoo · October 24, 2022, 3:13pm

I advice against using bcache in writeback mode. I’ve observed two events in which, due to a power failure or OS crash, the cache device became unrecoverable despite being placed on RAID1. There seems to be some ugly interaction between these two where, if the underlying device reports inconsistencies in some key data structures, bcache refuses to use the cache at all. I’ve lost one file system (had to recover from a backup) and almost lost another (thankfully it was btrfs configured with redundancy, so a scrub recovered all data; still, failing checksums were all over the place). This was years ago, before Storj existed. bcache is no longer developed, so I can only assume the problems still exist.

Write-through/write-around should be fine. Both will help with the file walker process taking ages, and as help for frequently downloaded pieces.

atomsymbol · October 25, 2022, 11:04am

16GB * 0.03 = 480MB. The size of btrfs metadata on my 2.5TB Storj node is 9.2GiB (18.4GiB if the metadata is duplicated).

Eioz · October 25, 2022, 9:07pm

Hi atomsymbol, hope you well, sounds like bcache is using only 480MB for caching processes, maybe btrfs metadata include also filetable (directories and files) which speed more than bcache IO in recursive random lookups for example and more metadata not cache properly. I read some benchmarks of btrfs versus bcache, on hard disk drives (HDD) it seems be at approximatively the same performance regarding cache. Could you confirm thoses points ? I saw a mount option with inode_cache used. Have fun using your btrfs filesystem and have fun on storj, i saw your recent posts, thanks to you for helping developers. Thanks, Eioz.

PS : for the moment i would take care of bcache even if it’s not developed anymore but i don’t know, i will wait for some more advices maybe regarding btrfs !

Alexey · October 26, 2022, 6:39am

I can advise to do not use btrfs for storagenode, see Topics tagged btrfs

atomsymbol · October 26, 2022, 11:38am

I agree that plain btrfs without bcache isn’t suitable for Storj. But plain ext4 (without lvm) isn’t suitable for Storj either.

In my experience, the read bandwidth of copying Storj files from ext4 is approximately 10 MB/s:

At 10MB/s it would take 1.16 days to copy 1TB of Storj data
The Storj node would have to be offline during the final run of rsync
The size of the target device (where the files are being copied to) has to be at least the size of the source device. If it is plain ext4 (without lvm) it is impossible to add a 2TB HDD to a 4TB node to get a 6TB node.

With btrfs:

The read-bandwidth of "btrfs device remove ..." depends on the read bandwidth of the HDD (from 80 to 240 MB/s)
"btrfs device remove ..." can run while the Storj node is online
I don’t have experience with lvm’s pvmove command (read bandwidth; source/target constraints of pvmove)

Toyoo · October 26, 2022, 8:35pm

I still have logs from some of migrations I performed few months ago, it was quite a bit faster than that, at ~30 to 40 MB/s, as measured by rsync --info=progress2.

And the final sync was a matter of minutes.

Maybe your storage has additional constraints that make operations slower?

As long as your extent size is of decent size (the default of 4 MB is ok), it’s pretty much raw sequential speeds.

Would probably be suitable enough if this suggestion was implemented.

atomsymbol · October 26, 2022, 9:09pm

Thanks. I have added max_inline=1K to my btrfs mount options (the default is 2K).

Alexey · October 27, 2022, 7:45am

it’s working fine and fast. LVM is a more flexible replacement for MBR in the first place, it doesn’t improve ext4 anyhow and cannot replace it, LVM improves working with volumes on more abstract level, it also can provide RAID options and some kind of COW, if needed.
I’m agree, that it’s better to use it, it can help do online migrations more seamlessly.
I have experience with pvmove - the speed almost the raw speed.
Just never use it for STORJ in the stripe mode - with the one died disk the whole node is dead.

please never do it. The one disk is died - the whole node is dead.

Eioz · October 27, 2022, 9:35am

Hi, with LVM you could use lvconvert -m 1 /dev/vg/lv /dev/sdg1 to add one miror after running pvcreate, vgextend commands. I will consider using this rsync command for the cold (after storagenode container deleted with docker rm)
Rsync command : rsync -rtuqp --delete --inplace --rsh=/usr/bin/rsh --no-compress /SOURCE/ /DEST
Explained :

rtuqp : Recurrsive,Times preserved, Update, Quiet, Preserve
delete : at the last sync perform file deletion if file is not on source
inplace : Do not copy a tmp file during file transfer
rsh=/usr/bin/rsh : Use rsync local deamon instead using ssh
no-compress : Don’t use compression

I hope i will not be suspended from Storj during the storagenode down maintenance period !
I will not create LVM on /dev/bcache0 as personnaly, i prefer launch another docker container after filling my mdadm RAID6 backing device. for 3TB rsync without thoses options (rsync command is on top) program is running since more than 24hours (with running rsync with rsync -rtuvp --bwlimit=100000, avoid using verbose, progress and bwlimit for local purposes i think, use quiet, inplace for the fisrt copy also rsh). Thanks to all !

One cool article on rsync : https://explainshell.com/explain/1/rsync

I keep you informed
Have fun !
Eioz

atomsymbol · October 27, 2022, 11:21am

How exactly did you compute that it is economically viable to use RAID1 for Storj?

If the payout paid to node operators per TB was approximately 3 times higher than it is today (today: $1.5/TBm), RAID1 would make sense from my viewpoint.

Monthly payout for Egress can be approximately 5 times larger than the payout per terabyte. Under such circumstances, RAID1 makes sense only if the node is at least 5TB.

What is the percentage of Storj nodes with at least 5TB of data, in respect to all Storj nodes?

According to Storj’s youtube video for 2022-Q2:

Payout to node operators: 0.4 million
3rd party providers: 2 million
Salary and bonus: 1.5 million
Other transactions: 5.9 million

Node operator profit share: 0.4 / (0.4+2+1.5+5.9) = 4.08%

Toyoo · October 27, 2022, 12:34pm

I don’t think @Alexey suggested that. How exactly did you arrive at this conclusion?

Eioz · October 27, 2022, 2:34pm

Hi, for me it’s better considering using RAID5 or RAID6 that are not mirror but stripping to don’t loose so much space capacity, simply buy more disks to get a RAID5 or RAID6 which are RAID technologies but not mirroring. I have today 8 disks for my Storj RAID6, 7 disks who form the RAID6 + 1 spare disk, i loose 8TB in this configuration + 4TB useless as spare. But i don’t want to lost datas as if i lost data my node could be start from zero again… Let give thanks for the erasure coding algorithm in place for storing piece at least 3 times accross the network, but no i don’t want to use this algorithm as i play for longtime with 20TB of potential data when my node will be full and i don’t want to wait to refill node in case of loosing datas. I think it will be nice for me to use bcache with mdadm without lvm, i will format bcache0 with xfs maybe Take consideration of the fact that you are only 1/3 copy (at least) of your whole data for price paid to storage nodes. Regards, Eioz.

heunland · October 27, 2022, 4:28pm

FYI what you calculated is Node Operator share of disbursements from STORJ token reserves (i.e. Storj Labs expenses paid in STORJ), not profit share. Calculating profits would require taking into account how much Storj Labs receives from customers for services rendered.

You can get more info on the Token Flow reports (this is only about the STORJ token flow, not cashflow)

atomsymbol · October 27, 2022, 6:11pm

I don’t have data which would support the claim that it is more than 4%. If you have such data, then post it here please. Thanks.

Alexey · October 27, 2022, 7:15pm

I never said that. I say that RAID0 is a dangerous configuration where you can easily lose the entire node if even one disk can die. Independently of your costs and profit.
I would like to suggest to take a look on RAID vs No RAID choice to do not repeat the whole cycle, why RAID costs more than 1 node per drive.

heunland · October 27, 2022, 8:09pm

I was taking issue with using the word ¨profit¨ in this context. Storj Labs will give a general update in the upcoming Townhall meeting on Nov 2. However, this is not going to include data as you would expect from a shareholder meeting as Storj is not a publicly traded company.

Eioz · October 28, 2022, 8:48pm

Hi, as the subject of the topic is about to use bcache with mdadm devices, bellow there is my method to do with HDDs and SSDs Cache (only for read “writearound” as other modes are recommended with BBUs).

Copy the whole data with rsrync :
rsync -rtuqp --delete --inplace --rsh=/usr/bin/rsh --no-compress /storage/hdd/STORJ/ /storage/hdd/STORJ_BACKUP/
Create bcache device with :
make-bcache -C /dev/raid1_ssd/01_cache -B /dev/md0 (used LVM and mdadm here)
Attach cache with cset UUID (if not already attached)
bcache-super-show /dev/raid1_ssd/01_cache (search for cset UUID)
echo XXXXXXXXXXXcsetUUID > /sys/block/bcache0/bcache/attach (get cache cset UUID from bcache-super-show)
Tune bcache0 (can be run as service in a script, values don’t be kept after reboot except maybe “cache_mode”)
echo writearound > /sys/block/bcache0/bcache/cache_mode
echo 0 > /sys/block/bcache0/bcache/sequential_cutoff
echo 0 > /sys/block/bcache0/bcache/cache/congested_read_threshold_us
Create XFS filesystem on bcache0
mkfs.xfs /dev/bcache0
mount /dev/bcache0 /storage/hdd/STORJ/
Copy again from backup filesystem to the new filesystem
rsync -rtuqp --delete --inplace --rsh=/usr/bin/rsh --no-compress /storage/hdd/STORJ_BACKUP/ /storage/hdd/STORJ/

I could tell you that bcache could use for cache or backing device, LVM or mdadm. I used LVM because i don’t have free partition slot on my SSD drives, so i do mdadm RAID 1 with last partition /dev/sdx4 (which /dev/sdx is a SSD) and creating LV with LVM on SSD mdadm RAID 1. Using actually LV (logical volume) on mdadm PV to do writearound cache at size of 20GB.

I will keep you informed regarding performances and cache filling.
Next step is to move database directory into storagenode container configuration or directly into config.yaml file (after moving safely DB directory and DB files) to RAID1 SSD Logical Volume.
Eioz