High i/o wait, controller issue?

gingerbread233 · January 13, 2024, 4:58am

Hello, I unfortunately made a mistake setting up storj back then. I thought, the more nodes, the faster it grows, so I set up a 12tb drive with 12 nodes, all on one hdd, which was dumb. No I have lots of small nodes with rather high i/o. I started to transfer each node to its own drive, to combat the high i/o wait. I have an old sharkoon t9 case, which has 9x5.25" bays, wich I filled with 3x4-slot hot swappeable bays. All 12 bays are connected to a pcie-sata card with 20x sata ports. Now I’m having the issue, that even though every node has its own drive, I have high i/o wait, mostly 100% (even after 24 hours, when the file walker should be finished his job). I made a little benchmark with hd hdparm, and it gave me a drive speed of around 200mb which is good. Now that I know that the sata card is limited to some extend regarding throuput, wich is not that big for storj, but rather the random reads and writes. Is it possible, that my pcie sata card can handle good throughput, but not a bunch small of random reads and writes? I’m using WD White label drives wich are cmr and not smr, so that shouldn’t be the issue.

I hope someone can help me out. Thanks in advance.

daki82 · January 13, 2024, 8:03am

wich mainboard and in wich controler, what slot is it connected?

gingerbread233 · January 13, 2024, 10:35am

I’m using a MSI B450M MORTAR Titanium with an amd ryzen 7 5700g. In the first pcie slot the Sata card is plugged in. It’s a random Chinese one since these are almost the only one you can find on Amazon (https://amzn.eu/d/iv2MgGi).

Roxor · January 13, 2024, 12:41pm

If that SATA card is PCIe 3.0 x1 that’s still 1GB/s: which sounds like more than enough for a dozen nodes in normal operation. But I don’t know it that’s true when they’re all running filewalker together (that’s max 80MB/s each… probably more like 50MB/s through the port multipliers)?

When you notice performance problems… has filewalker actually finished running? (like do you see any of them running in a process list)? I can easily imagine it taking more than a day if you say just restarted your nodes and there are a dozen filewalkers running in parallel.

Something like the LSI 9[2|3]xx-16i cards (or an expander if you have 2 slots) would certainly help with parallel-filewalker performance… but maybe this is something you can manage just by only starting 2-3 nodes at a time… so you never have more than 2-3 filewalkers running?

snorkel · January 13, 2024, 3:02pm

Stop all filewalkers and restart all nodes. See if the problem persists after a few hours, just to let GC finish.

gingerbread233 · January 13, 2024, 3:34pm

how can I stop the filewalkers on all nodes? My nodes are running via Docker on my Debian system. Does Blocksize of the drive also matter? On my 18TB drive I had to convert it to 64Bit so I can enlarge my copied 3TB to 18TB with “resize2fs -p”, chat GPT told me it converts my drive to 1K blocks, but when checkin it via “fdisk -l /dev/sdf” it tells me “I/O size (minimum/optimal): 4096 bytes / 4096 bytes”. I’m asking because even though on the 18TB Drive only 3TB of storj data is being stored, it does’t have that high I/O wait, but the RAM is filling slow and steady, with just the 18TB drive running. So there seem to be a bottleneck too. I’m running my node via docker compose with “- storage2.piece-scan-on-startup=false” as parameter in “enviroment:”.

gingerbread233 · January 13, 2024, 3:40pm

I think even when all filewakers are running 80mb/s should be still enough due to small filesizes and random read and writes, the drive is pretty “slow” compared to sequential read or write due to the characteristics of HDDs. So I think it should be enough. Even though with 80mb/s, 3TB should be fully read within around 11 hours (best case, what the sata controller will handle).

Roxor · January 13, 2024, 3:48pm

Snorkel’s idea of watching performance with no GC/FW running is a good: so you know what “normal” is. And you’re right 50-80MB/s is only a theoretical number for sequential transfers - if the disk is being asked to read a ton of tiny files it may only do 5-10MB/s.

Is there a node problem you’re trying to solve? Like do any show as offline due to high latency? Or are failing audits due to disk load? Or the UI takes too long to render? Or the logs show upload/download failure due to timeouts? (If each node is still showing ingress, and disk usage is increasing… if you made no changes would you still be OK?)

gingerbread233 · January 13, 2024, 5:24pm

I’m having the issue, that the dashboards are taking forever, and so the multinode dashboard. also when inspectiong the logs, I have lots of cancellations regarding uploads (which should be downloads for my node). That’s why want to adress the issue, so I can use full potential out of my possible ingress (which is at around 100GB a day, per IP-Adress at the moment).

daki82 · January 13, 2024, 8:43pm

Boy, thats what i call a bottleneck.

Even i had to deep dive to guess that the metal GPU slot is an PCIE3.0

For the Adapter card … its 3.0 x1 chip so the 0,985GB/s without overhead is right.
But there are also 3 5port multiplyers on this to have 12 ports.
Big Fail. you have in reality, 12 x1GB/s(6GB/s labeled) smashed trough 1GB/s wich results in 82MBs AT best zenario per drive. FYI, this is not enough.

daki82 · January 13, 2024, 8:51pm

You could solve it by changing the db’s to an SSD connected to the mainboad.

Utilize The M.2 and SATA of the mainboard asap.
Maybe the M.2 to 6xSATA is an option.
(theoreticaly there should be ports for an database ssd (sata) and the system drive left.)
For databases like for 12 nodes consider an high TBW drive.
(usual 512GB good quality, or bigger will do)
Leave at max4 drives on your 12port adaper. that could work.

You see an IOPs problem here. (slow access to the dashboard)

gingerbread233 · January 13, 2024, 10:46pm

The Mainboard has 4 sata ports, so I could connect some HDDs there in first place, to reduce overhead. How to I move the DBs to an SSD? How much space do they ususally take? Do the DBs really use that much I/o (IOPS)? I calculated aroun 80mb/s for each HDD and thought it’s enough. In random read/write scenarios, a HDD mostly isn’t faster than around 5-10mb/s. Why are M.2 Sata adapters better, than the pcie ones? They use mostly the same chipsets.

daki82 · January 13, 2024, 11:45pm

Be carefull with the Manual, it could be outdated in some lines.

Around a GB, not that much.

From my expierience: yes

iops_cmr
Here is the stats of my New drive. yours are propably less. While a filewalk is not even close to 3Mb/s its saturating iops completely. + the normal iops of the node itself.

carefull here, they are maybe, it depends. (it was an fast shot from me)
optimum would be SAS controlers in IT mode.
with PCIE ver3 4 or 8 lanes.(used?)
(the 2. big pcie slot has 8 lanes, if i remember right)
(supports sata, with the right cable, then my knowledge ends here)
The PCIE Standard differs between ver1, 2, 3, 4 (consumer) and the number of lanes that are connected. your controler uses 1 lane ver3.
see here

so in ver3 2 lanes for 6 drives are better than one for 12.(no mater wich mainboard socket you put it.)