High IO delay Direct mount bind zfs

EasyRhino · October 3, 2024, 10:49pm

how much data do you have on your node?

there is a log of “stuff” that storj nodes do on the backend that can trigger a lot of activity

used space walker which runs at startup and counts all files, if you have terabytes of storj data this can take days to finish
garbage collection jobs which run 4x daily, can take hours
trash deletion, which is usually fast but if you have 100’s of gigs of trash can also take many hours

I recommend looking at the recent bits of the storj logs for for words “used” or “Retain” or “empty”, and you can see if those jobs are still finished or currently running.

So in other words, having the disk hit with lots of activity is normal.

For larger nodes (say over 7TB) many folks have started using either l2arc (metadata only) or a special vdev for metadata to help things along.

arrogantrabbit · October 4, 2024, 6:44am

you either need more ram, or special device, or both. You can get away with L2ARC to a degree, but special device will be better.

The first two settings save IO by avoiding unnecessary operations, third setting does nothing, and the txg timeout would also do nothing unless you have tons of ram.

It’s not magic. Node generates a lot of IO, you can defer it and coalesce, but eventually data needs to be written.

How much ram is there?

molnart · October 4, 2024, 7:25am

this is how my storage node looks like

unfortunately i do not have free slots for zfs special device (i understand it has to be mirrored) and only have 16 GB of RAM. (physically 32 GB, but I am running other VMs on the device as well).

This is how my arch cache looks like

ARC status:                                                      HEALTHY
        Memory throttle count:                                         0

ARC size (current):                                   100.1 %    7.8 GiB
        Target size (adaptive):                       100.0 %    7.8 GiB
        Min size (hard limit):                          6.2 %  499.9 MiB
        Max size (high water):                           16:1    7.8 GiB
        Anonymous data size:                            0.2 %   16.0 MiB

So guess the only solution is to acquire for ram.

Alexey · October 4, 2024, 7:41am

Yes and disable lazy mode, if the node is running in the VM, because the host is not aware about low IO priority processes in the VM.

molnart · October 4, 2024, 7:48am

the VM with storagenode running is accessing the disks directly via the storage controller passed trough to it PCI.
how is lazy mode set? i could not find anything meaningful on it besides some github issues

Alexey · October 4, 2024, 8:02am

it’s enabled by default, you may disable it
See also

1arrcy1 · October 4, 2024, 10:00am

Hey,

I would like to give an update regarding this issue, i think the wait time was caused because i used HSMR drives without knowing. These drives slow down when you write for a sustained period, because its cache gets filled or having random reads and writes, They allow for cheaper storage with higher density with the cost of bad performance with random Read/writes.

Don’t be like me and get normal drives instead CMR for example, avoid SMR and HSMR

arrogantrabbit · October 4, 2024, 3:03pm

Oh no.

So ZFS is running inside the VM and only has whatever ram allocated to the vm?

I would change that. Node is self-contained software, is does not even need containerization, let alone virtualization.

Zfs works best when it can access a lot of free ram. By cramming it into the VM you are limiting it by the size of that VM, even thought you have plenty of free ram on the server.

They slow down when the disk fills. Node does not write sequential IO, so disk will not try to use CMR section, it will send data straight to SMR… with a lot of small files the disk runs out of free segments very fast then then every write is read-modify-write.

I don’t known who thought that was a great idea. There is literally no application where SMr disks work well. So, who would choose to buy them? The whole product line is based on misleading customers.

EasyRhino · October 4, 2024, 4:01pm

for SMR drives some folks have modified their config.yaml to reduce concurrent connections to a lower number like <10 to reduce the write traffic.

molnart · October 4, 2024, 5:51pm

moving into a baremetal NAS confiuration is something i have been considering for quite some time, i even got all the hardware for that. what is holding me back is:

the extra 100+ Watts of power consumption needed to run a second server with all the other VMs I am currently hosting on the machines with the NAS
possible performance penalties coming from all the apps accessing the data via NFS. now most of my softwares runs either on the NAS host itself or is accessing it via an internal virtual bridge network on the hypervisor itself. moving all that to another machine and accessing the NAS data via network, running two-hour snapshots, etc makes me concerned about all the stale file handle and performance issues i am about to face…

EasyRhino · October 4, 2024, 6:12pm

So, I run storj with all my storage shares mounted on NFS. It’s not supported, Alexey scolds me. Don’t do it if you don’t want the brooding majesty of his displeasure. But it still basically works.

“local” NFS shares (the NAS and storj docker are on the same system in different virtual machines) have performance that is pretty much indistinguishable from local disks. I did set up a l2arcs for my ZFS disks’ metadata and that helped a lot.

There are sometimes problems with ownership and permissions when I first set up a shared disk for storj on a new node, and I usually have to do a brute force chmod or chown on the disk on the NAS to make it work going forward.

Now, I also have a couple nodes where they are using NFS mount files over a slow network connection. Those are tougher, although they still work

Databases are stored local, not NFS
NFS needs to be async
lazy filewalkers fail, use non-lazy
used space and garbage collection and trash walkers all take a much longer time to run. i think it’s more latency than bandwidth. like 42 hours to delete 300GB of trash.
once in a while when under load the node will get backed up on write requests, and then run out of RAM. This hasn’t happened in the last couple of weeks, but happened some when under test load.

arrogantrabbit · October 4, 2024, 8:01pm

Why do you need an extra server? Can’t you just remove the wrapper VMs and run the things you run in them directly on the host you run VMs on? i.e. I dont’ understand, why removing VM requires you to add 100W to the picture.

This will also address this concern, even though there is nothing wrong with NFS.

arrogantrabbit · October 4, 2024, 8:06pm

This is because of SQLite Over a Network, Caveats and Considerations, and related to locking. But you don’t have to host sqlite databases over network, they can live on the local host. They are small…

Bingo. I need to start reading the whole posts before replying…

Deletion is always expensive, and often serialized. So yes, any latency will matter, but who cares? Its’ deletions. Nobody needs that data. So what if it takes a month to process.

wow. A triage opportunity

molnart · October 4, 2024, 8:31pm

If my server will only run the NAS baremetal i need some place to run all those other stuff that is currently running besides it

molnart · October 5, 2024, 3:36pm

i tried to play around with the ARC settings, also added some more RAM to the NAS VM, but it looks like the arc is full of misses. i have even disabled caching for all datasets except the storagenode, so the whole cache should be dedicated to it. how much RAM do I need to just serve storj?

    time  read  ddread  ddh%  dmread  dmh%  pread  ph%   size      c  avail
17:31:34     0       0     0       0     0      0    0   9.7G   9.8G   2.7G
17:31:39  4.8K     102    94    4.7K    98      0    0   9.7G   9.8G   2.6G
17:31:44   787      28    89     743    87     13    0   9.8G   9.8G   2.6G
17:31:49   814      14    85     799    86      0    0   9.8G   9.8G   2.6G
17:31:54   762      11    81     736    87     14    0   9.8G   9.8G   2.6G
17:31:59   862      28    85     819    88     13    0   9.7G   9.8G   2.6G
17:32:04   703      91    96     606    85      2   50   9.7G   9.8G   2.6G
17:32:09   877      35    85     827    89     14    0   9.7G   9.8G   2.6G
17:32:14   668      18    88     648    85      0    0   9.8G   9.8G   2.6G
17:32:19   851      14    85     821    89     16    0   9.8G   9.8G   2.5G
17:32:24   638      13    84     623    85      0    0   9.8G   9.8G   2.6G
17:32:29   733      21    85     695    87     15    0   9.8G   9.8G   2.6G
17:32:34   842      58    87     767    88     16    0   9.7G   9.8G   2.6G
17:32:39  4.3K      90    97    4.2K    98     40    0   9.8G   9.8G   2.5G
17:32:44   668      25    92     643    84      0    0   9.8G   9.8G   2.5G
17:32:49   938      15    73     907    89     14    0   9.7G   9.8G   2.6G
17:32:54   646      31    87     612    84      3    0   9.8G   9.8G   2.6G
17:32:59   977     113    96     847    88     17    0   9.8G   9.8G   2.5G
17:33:04   739      79    96     646    85     13    0   9.7G   9.8G   2.6G
17:33:09   783      23    86     758    88      0    0   9.7G   9.8G   2.5G

which is interesting, because arc_summary -s archhits shows the opposite

ZFS Subsystem Report                            Sat Oct 05 18:34:00 2024
Linux 6.8.12-2-pve                                            2.2.6-pve1
Machine: omv6 (x86_64)                                        2.2.6-pve1

ARC total accesses:                                                 1.2G
        Total hits:                                    98.7 %       1.2G
        Total I/O hits:                                 0.1 %       1.3M
        Total misses:                                   1.2 %      13.6M

EasyRhino · October 6, 2024, 12:32am

I don’t know how to read the top report, but remember that storj read I/O is highly random so it’s unlikely that the underlying requested files will be caught in ARC that often.

arc_summary shows a high hit rate because it includes metadata reads, of which storj and zfs in general has a boatload of.

setting up a l2arc for just metadata is helpful for storj. or a special vdev or metadata.

Alexey · October 6, 2024, 6:47am

This modification affects customers directly. With a choice of n in the node selection, it shouldn’t be needed at all - the slow node wouldn’t be selected to often.

@molnart
If you know how to cook it - go ahead, just be aware that this setup may consume more memory than iSCSI in the same configuration.

Alexey · October 6, 2024, 6:54am

Or you may try to enable a badger cache:

molnart · October 6, 2024, 7:10pm

somehow I managed to resolve the high wait/io, although I dunnu what exactly was the solution:

i have allocated some extra RAM for the NAS VM
played around with ZFS caching, basically switched off all caching for the “general” datasets and enabled only metatada caching for storj, backups and apps
moved away some VMs from the hypervisor to free up some more ram (but none of those where accessing any data from the pool with storj and where very lightweight in general)

Looking at the logs, i see some firewalker process finishing after ~27 hours processing around 1.7 TBs of data and looks like this was the moment when wait io went back to normal. Subsequent firewalkers took only a few minutes on around 10-15 GBs of data.

EDIT: …and the wait io is back. apparently a new firewalker job started in the morning, running for 2+ hours already. btw. how do i make logging not to include all the piecestore stuff? it has generated a 1.5 GB log file in just 3 days and i dont see how this information would be useful for me. also with the storagenod log files, my grep skills are somehow failing me, because i can see the firewalker events in the logs, but grep firewalker shows no results…

Alexey · October 7, 2024, 7:40am

There are several filewalkers, many of them are running regularly, or only on start.
The used-space-filewalker is running only on start, all others - runs periodically, see