Sounds interesting, looking forward to read more
The filewalkers are the big pain in larger nodes.
I donât think itâs a Pi5 limitation, though. Itâs more a HDD IOPS limitation.
The lower available RAM is a drawback but by and large I donât see my Pi5 node being CPU-constrained.
EDIT: I wonder how much better SATA would be over USB on the Pi5.
If weâre going to go down that path, then weâre right back at using something like this
Too messy.
Iâd like something like a SATA version of the Argon Eon, for example.
Something that looks nice and self-contained.
Trailing wires make my OCD twitch
With sync=disabled and the metadata cache the IOPs consumption should be a lower.
But lets face it. My current system consumes 154W all together. To run all 8 drives I might need a second Pi5. Even than it should cut my elctricity cost in halve. So even if the Pi5 will have some slowdowns it already has a headstart thanks to the lower operating costs.
First results are looking good. The filewalker isnât as fast as I was hoping for but manageable. A full run takes about 3 hours. For a full drive I would estimate 9-12 hours. While the filewalker is running the success rate goes down to 95%. While the filewalker is not active the success rate stays between 98 and 99%. Thats awesome.
Up next: Insert more drives and see if the success rate stays that high. It takes a long time to prepare the drives. I need to run chown
on the entire drive first and that takes forever.
I migrated all 8 hard drives to the Pi5. Success rate and data inflow is still strong. I am supprised how good it works. I was expecting that I can run maybe 4-6 drives on one Pi5 but now even 8 drives are running fine. At least for now. I will keep that running for some time and also hope to see some garbage collection action. The first run will be a bit more expensive because the cache isnât filled yet.
I usually used a Geekwormâs hardware for Pis - very robust and convenient.
Are they all USB-connected?
The case is linked further up in this thread.
I was able to test out the boundaries a bit. The CPU usage doesnât depend so much on the number of drives. It does scale with the number of uploads I am getting. Looks like the maximum is about 200MBit/s. At that point the Pi5 is running at 90-100% CPU usage.
I missed almost all the garbage collection runs this weekend. I donât know yet how that would impact my nodes. I will wait for these results before writing up a guide.
In terms of power consumption my old system used 150W total and the new setup needs just 100W. 35W are for router, firewall, phone, smart home box. â 65W for the pi5 + 8 drives. This means I am reducing my electricity bill by 150⏠per year.
I did setup the multinode dashboard and netdata on the Pi5. I doesnât look like it would steal too much CPU time. I will try to setup my grafana dashboard as well but with just 14 days of history to reduce resource consumption. Netdata seems to show some incorrect numbers but I didnât search in the internet for a solution yet.
I am hitting the first problem. journalctl tells me that my l2arc gets reset on reboot and it doesnât look like I could change that. The problem simply is that the Pi5 doesnât has enough memory for my l2arc. I am not sure how big I could make the l2arc. That is beyond my current capabilities.
To make this a positive outcome. I have learned a lot about the limitation of a storage node. I still find it amazing that a Pi5 can reach and maintain 200MBit/s. Even when garbage collection and used space file walker are running at the same time the upload rate is not impacted. I donât fully understand why. I see that the success rate goes down to 80% but it doesnât impact the throughput as much as I thought. That indicates a possible bug with the node selection to me. I remember @ elek was saying that the success tracker might have a too short history. So I am going to jump on that train and push the corresponding code change so that we can repeat the load test with a longer success tracker history and maybe my experimental setup will be a great test object.
I am not sure what I want to change next. I could order a second Pi5 and run 4 storage nodes per Pi5. I did a short test run and dropped all the l2arcs accept for 2 nodes. It looks like ZFS can still cut the file walker runtime in halve. And there is an argument to make that this is acceptable speed. On the other hand I have to try the same experiment with ext4 first. I want to test how good a plain vanilla ext4 node would run and which options I have to improve performance. Especially the experimental cache that was recently added to the storage node might get an ext4 node to similar performance. And there is an inode cache thing that I could also try out.
I didnât do the math for ZFS. I believe I could run 2 nodes just fine and benefit from the l2arc performance boost. There are also some Pi5 alternatives with 16 or 32GB or RAM. Out of the box ZFS would used halve the memory but that can be changed. So to max out the l2arc benefits I would try 4GB for the storage nodes and 12/28GB for ZFS. At that point the l2arc might support 6/14 storage nodes.
would you be able to post the outputs of;
cat /proc/interrupts
cat /proc/meminfo
really curious to see how things are hanging together, as different kernels/distros will have different options applied, thereâs a huge legacy on rpi4 over IRQâs being non-optimal, and CMA issues - I havenât had a chance yet to try this out on a RPi5 with the stuff you are testing so if you had a chance, to post would be useful
storagenode@pi5storagenode:~ $ cat /proc/interrupts
CPU0 CPU1 CPU2 CPU3
9: 0 0 0 0 GICv2 25 Level vgic
11: 0 0 0 0 GICv2 30 Level kvm guest ptimer
12: 0 0 0 0 GICv2 27 Level kvm guest vtimer
13: 121821939 149649080 148596923 147620533 GICv2 26 Level arch_timer
14: 44241 0 0 0 GICv2 65 Level 107c013880.mailbox
15: 5 0 0 0 GICv2 153 Level uart-pl011
21: 0 0 0 0 GICv2 119 Level DMA IRQ
22: 0 0 0 0 GICv2 120 Level DMA IRQ
23: 0 0 0 0 GICv2 121 Level DMA IRQ
24: 0 0 0 0 GICv2 122 Level DMA IRQ
33: 2966 0 0 0 GICv2 308 Level ttyS0
34: 0 0 0 0 GICv2 48 Level arm-pmu
35: 0 0 0 0 GICv2 49 Level arm-pmu
36: 0 0 0 0 GICv2 50 Level arm-pmu
37: 0 0 0 0 GICv2 51 Level arm-pmu
38: 0 0 0 0 GICv2 251 Level PCIe PME, aerdrv
39: 14 0 0 0 MIP-MSI 524288 Edge nvme0q0
40: 574328 0 0 0 MIP-MSI 524289 Edge nvme0q1
41: 0 703623 0 0 MIP-MSI 524290 Edge nvme0q2
42: 0 0 690764 0 MIP-MSI 524291 Edge nvme0q3
43: 0 0 0 657780 MIP-MSI 524292 Edge nvme0q4
44: 0 0 0 0 GICv2 261 Level PCIe PME, aerdrv
112: 370187684 0 0 0 rp1_irq_chip 6 Level eth0
137: 81104249 0 0 0 rp1_irq_chip 31 Edge xhci-hcd:usb1
142: 52967856 0 0 0 rp1_irq_chip 36 Edge xhci-hcd:usb3
146: 0 0 0 0 rp1_irq_chip 40 Level dw_axi_dmac_platform
167: 0 0 0 0 GICv2 305 Level mmc0
168: 2004112 0 0 0 GICv2 306 Level mmc1
169: 0 0 0 0 107d508500.gpio 20 Edge pwr_button
170: 0 0 0 0 intc@7d508380 1 Level 107d508200.i2c
171: 0 0 0 0 GICv2 150 Level 107d004000.spi
172: 0 0 0 0 intc@7d508380 2 Level 107d508280.i2c
173: 0 0 0 0 GICv2 281 Level v3d_core0
174: 0 0 0 0 GICv2 282 Level v3d_hub
175: 0 0 0 0 GICv2 104 Level pispbe
176: 0 0 0 0 GICv2 130 Level 1000800000.codec
177: 0 0 0 0 interrupt-controller@7c502000 2 Level 107c580000.hvs
178: 0 0 0 0 interrupt-controller@7c502000 9 Level 107c580000.hvs
179: 0 0 0 0 interrupt-controller@7c502000 16 Level 107c580000.hvs
180: 0 0 0 0 interrupt-controller@7d510600 7 Level vc4 hdmi hpd connected
181: 0 0 0 0 interrupt-controller@7d510600 8 Level vc4 hdmi hpd disconnected
182: 0 0 0 0 interrupt-controller@7d510600 2 Level vc4 hdmi cec rx
183: 0 0 0 0 interrupt-controller@7d510600 1 Level vc4 hdmi cec tx
184: 0 0 0 0 interrupt-controller@7d510600 14 Level vc4 hdmi hpd connected
185: 0 0 0 0 interrupt-controller@7d510600 15 Level vc4 hdmi hpd disconnected
186: 0 0 0 0 interrupt-controller@7d510600 12 Level vc4 hdmi cec rx
187: 0 0 0 0 interrupt-controller@7d510600 11 Level vc4 hdmi cec tx
188: 0 0 0 0 interrupt-controller@7c502000 1 Level 107c500000.mop
189: 0 0 0 0 interrupt-controller@7c502000 0 Level 107c501000.moplet
190: 0 0 0 0 GICv2 133 Level vc4 crtc
191: 0 0 0 0 GICv2 142 Level vc4 crtc
IPI0: 1297271 1395693 1406626 1416211 Rescheduling interrupts
IPI1: 59378151 97021326 96114234 95469328 Function call interrupts
IPI2: 0 0 0 0 CPU stop interrupts
IPI3: 0 0 0 0 CPU stop (for crash dump) interrupts
IPI4: 0 0 0 0 Timer broadcast interrupts
IPI5: 4094 4369 4271 4284 IRQ work interrupts
IPI6: 0 0 0 0 CPU wake-up interrupts
Err: 0
storagenode@pi5storagenode:~ $ cat /proc/meminfo
MemTotal: 8245648 kB
MemFree: 423712 kB
MemAvailable: 833440 kB
Buffers: 8112 kB
Cached: 481024 kB
SwapCached: 2336 kB
Active: 764064 kB
Inactive: 3586128 kB
Active(anon): 391840 kB
Inactive(anon): 3469776 kB
Active(file): 372224 kB
Inactive(file): 116352 kB
Unevictable: 0 kB
Mlocked: 0 kB
SwapTotal: 204784 kB
SwapFree: 1056 kB
Zswap: 0 kB
Zswapped: 0 kB
Dirty: 3168 kB
Writeback: 0 kB
AnonPages: 3859216 kB
Mapped: 411184 kB
Shmem: 400 kB
KReclaimable: 56016 kB
Slab: 848336 kB
SReclaimable: 56016 kB
SUnreclaim: 792320 kB
KernelStack: 23936 kB
PageTables: 18752 kB
SecPageTables: 0 kB
NFS_Unstable: 0 kB
Bounce: 0 kB
WritebackTmp: 0 kB
CommitLimit: 4327600 kB
Committed_AS: 8725616 kB
VmallocTotal: 68180246528 kB
VmallocUsed: 1060640 kB
VmallocChunk: 0 kB
Percpu: 1472 kB
CmaTotal: 327680 kB
CmaFree: 199328 kB
thank you, thatâs really annoying lol⌠CmaTotal isnât what I was expectingâŚ256k or 512k, but CmaFree looks fixed, as not zero.
IRQ is broken stillâŚ
can u do a uname -a
If you fancy trying something to improve performance, we can try shifting the IRQ CPU handlingâŚ
Change wonât survive a reboot, but you will get better usage of the cores for ZFSâŚ
echo 8 > /proc/irq/142/smp_affinity
echo 4 > /proc/irq/137/smp_affinity
echo 2 > /proc/irq/112/smp_affinity
We move the xhci to separate cores away from core 0, and then put Eth0 on itâs own core, to spread the load.
would be interesting if you see a change in the time for Walkers and Stuff to run nowâŚ
Hi @littleskunk - I donât know if you knew, but I create videos about my nodes from time to time, and I recently made this video of a 5hdd SATA hat for pi5 with pcie:
Maybe that is of use?
storagenode@pi5storagenode:~ $ uname -a
Linux pi5storagenode 6.6.31+rpt-rpi-2712 #1 SMP PREEMPT Debian 1:6.6.31-1+rpt1 (2024-05-29) aarch64 GNU/Linux
Sounds interesting. How do I know if it gets better?
I appreciate you posting your RPi 5 adventures. 8 nodes may be optimistic⌠but you donât know until you try!
If I were to channel @IsThisOn for a moment: I bet if he wanted that many nodes on a 8GB RPi⌠he may
- limit ARC to 2GB (the Pi will need RAM just to run 8 nodes)
- skip L2ARC entirely
- use a 2-NVMe hat with dual cheap 1TB NVMeâs
- Chop each 1TB into 8 100GB partitions: then mirror them as special-metadata metadata-only devices: so each mirror could handle one HDD up to 20TB
Youâll never have enough RAM for ARC to be awesome: but anything filewalker-related should always be speedy: and no waiting for L2ARC to slowly fill!
It turns out I canât fill the l2arc because the size of that depends on the available RAM. Is a special metadata device different or has it the same limitation? I would expect the same limitation.
In ZFS, everything goes into ARC/memory first. As it fills⌠if you have L2ARC configured⌠the oldest/least-used entries slowly get moved to L2ARC and a placeholder pointer to them is left in ARC. So in a small sense⌠those L2ARC pointers use up real RAM that could otherwise hold âreal dataâ in ARC - but in general you still come out ahead because having the moved data still in L2ARC is still faster than hitting the HDD.
But Iâm surprised you canât have L2ARC at all? I donât see why not.
I understand what youâre going for: to have a metadata-only L2ARC that would slowly populate over time⌠so it handled all filewalker IO. And having it persistent (which it sounds like isnât working?) would prevent you from having to slowly refill L2ARC from-scratch after every reboot. Sounds like a good idea!
A metadata special-device is a bit different. It handles 100% of the metadata IO, all the time, no cache to warm up, and itâs not filled by data evicted from ARC. It has no RAM limitations. It also handles all metadata writes (something L2ARC doesnât do). And it can also (optionally) handle small-files so the HDD only deals with the larger stuff itâs better at. But unlike L2ARC (which can disappear or fail at any time, no problem) ⌠if you lose a metadata special-device: youâve lost the filesystem. Thatâs why⌠especially if it was handling 8 nodes for you: youâd want to to at least mirror it. But just for testing you could use a single device like your Optane.
ARC is still doing itâs own thing on top: and may decide to hold some metadata in RAM: that will always still be a win. But if itâs not in ARC⌠all metadata that filewalker touches would be on that metadata special-device SSD. So always speedy, and never touching the HDD. But a potential point of failure.
It looks like IsThisOn got some of his fastest/most-consistent filewalker performance from his special-metadata config.