Recommended hardware for a Pi5 setup?

Ottetal · June 27, 2024, 1:08pm

Sounds interesting, looking forward to read more

ACarneiro · June 27, 2024, 1:17pm

The filewalkers are the big pain in larger nodes.
I don’t think it’s a Pi5 limitation, though. It’s more a HDD IOPS limitation.
The lower available RAM is a drawback but by and large I don’t see my Pi5 node being CPU-constrained.

EDIT: I wonder how much better SATA would be over USB on the Pi5.

Ottetal · June 27, 2024, 1:20pm

If we’re going to go down that path, then we’re right back at using something like this

ACarneiro · June 27, 2024, 1:31pm

Too messy.
I’d like something like a SATA version of the Argon Eon, for example.
Something that looks nice and self-contained.
Trailing wires make my OCD twitch

littleskunk · June 27, 2024, 1:34pm

With sync=disabled and the metadata cache the IOPs consumption should be a lower.

But lets face it. My current system consumes 154W all together. To run all 8 drives I might need a second Pi5. Even than it should cut my elctricity cost in halve. So even if the Pi5 will have some slowdowns it already has a headstart thanks to the lower operating costs.

littleskunk · June 27, 2024, 10:35pm

First results are looking good. The filewalker isn’t as fast as I was hoping for but manageable. A full run takes about 3 hours. For a full drive I would estimate 9-12 hours. While the filewalker is running the success rate goes down to 95%. While the filewalker is not active the success rate stays between 98 and 99%. Thats awesome.

Up next: Insert more drives and see if the success rate stays that high. It takes a long time to prepare the drives. I need to run chown on the entire drive first and that takes forever.

littleskunk · June 30, 2024, 3:18pm

I migrated all 8 hard drives to the Pi5. Success rate and data inflow is still strong. I am supprised how good it works. I was expecting that I can run maybe 4-6 drives on one Pi5 but now even 8 drives are running fine. At least for now. I will keep that running for some time and also hope to see some garbage collection action. The first run will be a bit more expensive because the cache isn’t filled yet.

Alexey · July 1, 2024, 8:05am

I usually used a Geekworm’s hardware for Pis - very robust and convenient.

Alexey · July 1, 2024, 8:09am

Are they all USB-connected?

littleskunk · July 1, 2024, 10:17am

The case is linked further up in this thread.

littleskunk · July 1, 2024, 3:19pm

I was able to test out the boundaries a bit. The CPU usage doesn’t depend so much on the number of drives. It does scale with the number of uploads I am getting. Looks like the maximum is about 200MBit/s. At that point the Pi5 is running at 90-100% CPU usage.

I missed almost all the garbage collection runs this weekend. I don’t know yet how that would impact my nodes. I will wait for these results before writing up a guide.

In terms of power consumption my old system used 150W total and the new setup needs just 100W. 35W are for router, firewall, phone, smart home box. → 65W for the pi5 + 8 drives. This means I am reducing my electricity bill by 150€ per year.

I did setup the multinode dashboard and netdata on the Pi5. I doesn’t look like it would steal too much CPU time. I will try to setup my grafana dashboard as well but with just 14 days of history to reduce resource consumption. Netdata seems to show some incorrect numbers but I didn’t search in the internet for a solution yet.

littleskunk · July 8, 2024, 9:35am

I am hitting the first problem. journalctl tells me that my l2arc gets reset on reboot and it doesn’t look like I could change that. The problem simply is that the Pi5 doesn’t has enough memory for my l2arc. I am not sure how big I could make the l2arc. That is beyond my current capabilities.

To make this a positive outcome. I have learned a lot about the limitation of a storage node. I still find it amazing that a Pi5 can reach and maintain 200MBit/s. Even when garbage collection and used space file walker are running at the same time the upload rate is not impacted. I don’t fully understand why. I see that the success rate goes down to 80% but it doesn’t impact the throughput as much as I thought. That indicates a possible bug with the node selection to me. I remember @ elek was saying that the success tracker might have a too short history. So I am going to jump on that train and push the corresponding code change so that we can repeat the load test with a longer success tracker history and maybe my experimental setup will be a great test object.

I am not sure what I want to change next. I could order a second Pi5 and run 4 storage nodes per Pi5. I did a short test run and dropped all the l2arcs accept for 2 nodes. It looks like ZFS can still cut the file walker runtime in halve. And there is an argument to make that this is acceptable speed. On the other hand I have to try the same experiment with ext4 first. I want to test how good a plain vanilla ext4 node would run and which options I have to improve performance. Especially the experimental cache that was recently added to the storage node might get an ext4 node to similar performance. And there is an inode cache thing that I could also try out.

I didn’t do the math for ZFS. I believe I could run 2 nodes just fine and benefit from the l2arc performance boost. There are also some Pi5 alternatives with 16 or 32GB or RAM. Out of the box ZFS would used halve the memory but that can be changed. So to max out the l2arc benefits I would try 4GB for the storage nodes and 12/28GB for ZFS. At that point the l2arc might support 6/14 storage nodes.

CutieePie · July 8, 2024, 11:46am

would you be able to post the outputs of;

cat /proc/interrupts
cat /proc/meminfo

really curious to see how things are hanging together, as different kernels/distros will have different options applied, there’s a huge legacy on rpi4 over IRQ’s being non-optimal, and CMA issues - I haven’t had a chance yet to try this out on a RPi5 with the stuff you are testing so if you had a chance, to post would be useful

littleskunk · July 8, 2024, 11:48am

storagenode@pi5storagenode:~ $ cat /proc/interrupts
           CPU0       CPU1       CPU2       CPU3
  9:          0          0          0          0     GICv2  25 Level     vgic
 11:          0          0          0          0     GICv2  30 Level     kvm guest ptimer
 12:          0          0          0          0     GICv2  27 Level     kvm guest vtimer
 13:  121821939  149649080  148596923  147620533     GICv2  26 Level     arch_timer
 14:      44241          0          0          0     GICv2  65 Level     107c013880.mailbox
 15:          5          0          0          0     GICv2 153 Level     uart-pl011
 21:          0          0          0          0     GICv2 119 Level     DMA IRQ
 22:          0          0          0          0     GICv2 120 Level     DMA IRQ
 23:          0          0          0          0     GICv2 121 Level     DMA IRQ
 24:          0          0          0          0     GICv2 122 Level     DMA IRQ
 33:       2966          0          0          0     GICv2 308 Level     ttyS0
 34:          0          0          0          0     GICv2  48 Level     arm-pmu
 35:          0          0          0          0     GICv2  49 Level     arm-pmu
 36:          0          0          0          0     GICv2  50 Level     arm-pmu
 37:          0          0          0          0     GICv2  51 Level     arm-pmu
 38:          0          0          0          0     GICv2 251 Level     PCIe PME, aerdrv
 39:         14          0          0          0   MIP-MSI 524288 Edge      nvme0q0
 40:     574328          0          0          0   MIP-MSI 524289 Edge      nvme0q1
 41:          0     703623          0          0   MIP-MSI 524290 Edge      nvme0q2
 42:          0          0     690764          0   MIP-MSI 524291 Edge      nvme0q3
 43:          0          0          0     657780   MIP-MSI 524292 Edge      nvme0q4
 44:          0          0          0          0     GICv2 261 Level     PCIe PME, aerdrv
112:  370187684          0          0          0  rp1_irq_chip   6 Level     eth0
137:   81104249          0          0          0  rp1_irq_chip  31 Edge      xhci-hcd:usb1
142:   52967856          0          0          0  rp1_irq_chip  36 Edge      xhci-hcd:usb3
146:          0          0          0          0  rp1_irq_chip  40 Level     dw_axi_dmac_platform
167:          0          0          0          0     GICv2 305 Level     mmc0
168:    2004112          0          0          0     GICv2 306 Level     mmc1
169:          0          0          0          0  107d508500.gpio  20 Edge      pwr_button
170:          0          0          0          0  intc@7d508380   1 Level     107d508200.i2c
171:          0          0          0          0     GICv2 150 Level     107d004000.spi
172:          0          0          0          0  intc@7d508380   2 Level     107d508280.i2c
173:          0          0          0          0     GICv2 281 Level     v3d_core0
174:          0          0          0          0     GICv2 282 Level     v3d_hub
175:          0          0          0          0     GICv2 104 Level     pispbe
176:          0          0          0          0     GICv2 130 Level     1000800000.codec
177:          0          0          0          0  interrupt-controller@7c502000   2 Level     107c580000.hvs
178:          0          0          0          0  interrupt-controller@7c502000   9 Level     107c580000.hvs
179:          0          0          0          0  interrupt-controller@7c502000  16 Level     107c580000.hvs
180:          0          0          0          0  interrupt-controller@7d510600   7 Level     vc4 hdmi hpd connected
181:          0          0          0          0  interrupt-controller@7d510600   8 Level     vc4 hdmi hpd disconnected
182:          0          0          0          0  interrupt-controller@7d510600   2 Level     vc4 hdmi cec rx
183:          0          0          0          0  interrupt-controller@7d510600   1 Level     vc4 hdmi cec tx
184:          0          0          0          0  interrupt-controller@7d510600  14 Level     vc4 hdmi hpd connected
185:          0          0          0          0  interrupt-controller@7d510600  15 Level     vc4 hdmi hpd disconnected
186:          0          0          0          0  interrupt-controller@7d510600  12 Level     vc4 hdmi cec rx
187:          0          0          0          0  interrupt-controller@7d510600  11 Level     vc4 hdmi cec tx
188:          0          0          0          0  interrupt-controller@7c502000   1 Level     107c500000.mop
189:          0          0          0          0  interrupt-controller@7c502000   0 Level     107c501000.moplet
190:          0          0          0          0     GICv2 133 Level     vc4 crtc
191:          0          0          0          0     GICv2 142 Level     vc4 crtc
IPI0:   1297271    1395693    1406626    1416211       Rescheduling interrupts
IPI1:  59378151   97021326   96114234   95469328       Function call interrupts
IPI2:         0          0          0          0       CPU stop interrupts
IPI3:         0          0          0          0       CPU stop (for crash dump) interrupts
IPI4:         0          0          0          0       Timer broadcast interrupts
IPI5:      4094       4369       4271       4284       IRQ work interrupts
IPI6:         0          0          0          0       CPU wake-up interrupts
Err:          0

storagenode@pi5storagenode:~ $ cat /proc/meminfo
MemTotal:        8245648 kB
MemFree:          423712 kB
MemAvailable:     833440 kB
Buffers:            8112 kB
Cached:           481024 kB
SwapCached:         2336 kB
Active:           764064 kB
Inactive:        3586128 kB
Active(anon):     391840 kB
Inactive(anon):  3469776 kB
Active(file):     372224 kB
Inactive(file):   116352 kB
Unevictable:           0 kB
Mlocked:               0 kB
SwapTotal:        204784 kB
SwapFree:           1056 kB
Zswap:                 0 kB
Zswapped:              0 kB
Dirty:              3168 kB
Writeback:             0 kB
AnonPages:       3859216 kB
Mapped:           411184 kB
Shmem:               400 kB
KReclaimable:      56016 kB
Slab:             848336 kB
SReclaimable:      56016 kB
SUnreclaim:       792320 kB
KernelStack:       23936 kB
PageTables:        18752 kB
SecPageTables:         0 kB
NFS_Unstable:          0 kB
Bounce:                0 kB
WritebackTmp:          0 kB
CommitLimit:     4327600 kB
Committed_AS:    8725616 kB
VmallocTotal:   68180246528 kB
VmallocUsed:     1060640 kB
VmallocChunk:          0 kB
Percpu:             1472 kB
CmaTotal:         327680 kB
CmaFree:          199328 kB

CutieePie · July 8, 2024, 12:06pm

thank you, that’s really annoying lol… CmaTotal isn’t what I was expecting…256k or 512k, but CmaFree looks fixed, as not zero.

IRQ is broken still…

can u do a uname -a

If you fancy trying something to improve performance, we can try shifting the IRQ CPU handling…

Change won’t survive a reboot, but you will get better usage of the cores for ZFS…

echo 8 > /proc/irq/142/smp_affinity
echo 4 > /proc/irq/137/smp_affinity
echo 2 > /proc/irq/112/smp_affinity

We move the xhci to separate cores away from core 0, and then put Eth0 on it’s own core, to spread the load.

would be interesting if you see a change in the time for Walkers and Stuff to run now…

HGPlays · July 8, 2024, 12:30pm

Hi @littleskunk - I don’t know if you knew, but I create videos about my nodes from time to time, and I recently made this video of a 5hdd SATA hat for pi5 with pcie:

Maybe that is of use?

littleskunk · July 8, 2024, 12:35pm

storagenode@pi5storagenode:~ $ uname -a
Linux pi5storagenode 6.6.31+rpt-rpi-2712 #1 SMP PREEMPT Debian 1:6.6.31-1+rpt1 (2024-05-29) aarch64 GNU/Linux

Sounds interesting. How do I know if it gets better?

Roxor · July 8, 2024, 12:44pm

I appreciate you posting your RPi 5 adventures. 8 nodes may be optimistic… but you don’t know until you try!

If I were to channel @IsThisOn for a moment: I bet if he wanted that many nodes on a 8GB RPi… he may

limit ARC to 2GB (the Pi will need RAM just to run 8 nodes)
skip L2ARC entirely
use a 2-NVMe hat with dual cheap 1TB NVMe’s
Chop each 1TB into 8 100GB partitions: then mirror them as special-metadata metadata-only devices: so each mirror could handle one HDD up to 20TB

You’ll never have enough RAM for ARC to be awesome: but anything filewalker-related should always be speedy: and no waiting for L2ARC to slowly fill!

littleskunk · July 8, 2024, 12:52pm

It turns out I can’t fill the l2arc because the size of that depends on the available RAM. Is a special metadata device different or has it the same limitation? I would expect the same limitation.

Roxor · July 8, 2024, 1:17pm

In ZFS, everything goes into ARC/memory first. As it fills… if you have L2ARC configured… the oldest/least-used entries slowly get moved to L2ARC and a placeholder pointer to them is left in ARC. So in a small sense… those L2ARC pointers use up real RAM that could otherwise hold ‘real data’ in ARC - but in general you still come out ahead because having the moved data still in L2ARC is still faster than hitting the HDD.

But I’m surprised you can’t have L2ARC at all? I don’t see why not.

I understand what you’re going for: to have a metadata-only L2ARC that would slowly populate over time… so it handled all filewalker IO. And having it persistent (which it sounds like isn’t working?) would prevent you from having to slowly refill L2ARC from-scratch after every reboot. Sounds like a good idea!

A metadata special-device is a bit different. It handles 100% of the metadata IO, all the time, no cache to warm up, and it’s not filled by data evicted from ARC. It has no RAM limitations. It also handles all metadata writes (something L2ARC doesn’t do). And it can also (optionally) handle small-files so the HDD only deals with the larger stuff it’s better at. But unlike L2ARC (which can disappear or fail at any time, no problem) … if you lose a metadata special-device: you’ve lost the filesystem. That’s why… especially if it was handling 8 nodes for you: you’d want to to at least mirror it. But just for testing you could use a single device like your Optane.

ARC is still doing it’s own thing on top: and may decide to hold some metadata in RAM: that will always still be a win. But if it’s not in ARC… all metadata that filewalker touches would be on that metadata special-device SSD. So always speedy, and never touching the HDD. But a potential point of failure.

It looks like IsThisOn got some of his fastest/most-consistent filewalker performance from his special-metadata config.