there is a log of “stuff” that storj nodes do on the backend that can trigger a lot of activity
used space walker which runs at startup and counts all files, if you have terabytes of storj data this can take days to finish
garbage collection jobs which run 4x daily, can take hours
trash deletion, which is usually fast but if you have 100’s of gigs of trash can also take many hours
I recommend looking at the recent bits of the storj logs for for words “used” or “Retain” or “empty”, and you can see if those jobs are still finished or currently running.
So in other words, having the disk hit with lots of activity is normal.
For larger nodes (say over 7TB) many folks have started using either l2arc (metadata only) or a special vdev for metadata to help things along.
you either need more ram, or special device, or both. You can get away with L2ARC to a degree, but special device will be better.
The first two settings save IO by avoiding unnecessary operations, third setting does nothing, and the txg timeout would also do nothing unless you have tons of ram.
It’s not magic. Node generates a lot of IO, you can defer it and coalesce, but eventually data needs to be written.
unfortunately i do not have free slots for zfs special device (i understand it has to be mirrored) and only have 16 GB of RAM. (physically 32 GB, but I am running other VMs on the device as well).
the VM with storagenode running is accessing the disks directly via the storage controller passed trough to it PCI.
how is lazy mode set? i could not find anything meaningful on it besides some github issues
I would like to give an update regarding this issue, i think the wait time was caused because i used HSMR drives without knowing. These drives slow down when you write for a sustained period, because its cache gets filled or having random reads and writes, They allow for cheaper storage with higher density with the cost of bad performance with random Read/writes.
Don’t be like me and get normal drives instead CMR for example, avoid SMR and HSMR
So ZFS is running inside the VM and only has whatever ram allocated to the vm?
I would change that. Node is self-contained software, is does not even need containerization, let alone virtualization.
Zfs works best when it can access a lot of free ram. By cramming it into the VM you are limiting it by the size of that VM, even thought you have plenty of free ram on the server.
They slow down when the disk fills. Node does not write sequential IO, so disk will not try to use CMR section, it will send data straight to SMR… with a lot of small files the disk runs out of free segments very fast then then every write is read-modify-write.
I don’t known who thought that was a great idea. There is literally no application where SMr disks work well. So, who would choose to buy them? The whole product line is based on misleading customers.
moving into a baremetal NAS confiuration is something i have been considering for quite some time, i even got all the hardware for that. what is holding me back is:
the extra 100+ Watts of power consumption needed to run a second server with all the other VMs I am currently hosting on the machines with the NAS
possible performance penalties coming from all the apps accessing the data via NFS. now most of my softwares runs either on the NAS host itself or is accessing it via an internal virtual bridge network on the hypervisor itself. moving all that to another machine and accessing the NAS data via network, running two-hour snapshots, etc makes me concerned about all the stale file handle and performance issues i am about to face…
So, I run storj with all my storage shares mounted on NFS. It’s not supported, Alexey scolds me. Don’t do it if you don’t want the brooding majesty of his displeasure. But it still basically works.
“local” NFS shares (the NAS and storj docker are on the same system in different virtual machines) have performance that is pretty much indistinguishable from local disks. I did set up a l2arcs for my ZFS disks’ metadata and that helped a lot.
There are sometimes problems with ownership and permissions when I first set up a shared disk for storj on a new node, and I usually have to do a brute force chmod or chown on the disk on the NAS to make it work going forward.
Now, I also have a couple nodes where they are using NFS mount files over a slow network connection. Those are tougher, although they still work
Databases are stored local, not NFS
NFS needs to be async
lazy filewalkers fail, use non-lazy
used space and garbage collection and trash walkers all take a much longer time to run. i think it’s more latency than bandwidth. like 42 hours to delete 300GB of trash.
once in a while when under load the node will get backed up on write requests, and then run out of RAM. This hasn’t happened in the last couple of weeks, but happened some when under test load.
Why do you need an extra server? Can’t you just remove the wrapper VMs and run the things you run in them directly on the host you run VMs on? i.e. I dont’ understand, why removing VM requires you to add 100W to the picture.
This will also address this concern, even though there is nothing wrong with NFS.
Bingo. I need to start reading the whole posts before replying…
Deletion is always expensive, and often serialized. So yes, any latency will matter, but who cares? Its’ deletions. Nobody needs that data. So what if it takes a month to process.
i tried to play around with the ARC settings, also added some more RAM to the NAS VM, but it looks like the arc is full of misses. i have even disabled caching for all datasets except the storagenode, so the whole cache should be dedicated to it. how much RAM do I need to just serve storj?
which is interesting, because arc_summary -s archhits shows the opposite
ZFS Subsystem Report Sat Oct 05 18:34:00 2024
Linux 6.8.12-2-pve 2.2.6-pve1
Machine: omv6 (x86_64) 2.2.6-pve1
ARC total accesses: 1.2G
Total hits: 98.7 % 1.2G
Total I/O hits: 0.1 % 1.3M
Total misses: 1.2 % 13.6M
I don’t know how to read the top report, but remember that storj read I/O is highly random so it’s unlikely that the underlying requested files will be caught in ARC that often.
arc_summary shows a high hit rate because it includes metadata reads, of which storj and zfs in general has a boatload of.
setting up a l2arc for just metadata is helpful for storj. or a special vdev or metadata.
This modification affects customers directly. With a choice of n in the node selection, it shouldn’t be needed at all - the slow node wouldn’t be selected to often.
@molnart
If you know how to cook it - go ahead, just be aware that this setup may consume more memory than iSCSI in the same configuration.
somehow I managed to resolve the high wait/io, although I dunnu what exactly was the solution:
i have allocated some extra RAM for the NAS VM
played around with ZFS caching, basically switched off all caching for the “general” datasets and enabled only metatada caching for storj, backups and apps
moved away some VMs from the hypervisor to free up some more ram (but none of those where accessing any data from the pool with storj and where very lightweight in general)
Looking at the logs, i see some firewalker process finishing after ~27 hours processing around 1.7 TBs of data and looks like this was the moment when wait io went back to normal. Subsequent firewalkers took only a few minutes on around 10-15 GBs of data.
EDIT: …and the wait io is back. apparently a new firewalker job started in the morning, running for 2+ hours already. btw. how do i make logging not to include all the piecestore stuff? it has generated a 1.5 GB log file in just 3 days and i dont see how this information would be useful for me. also with the storagenod log files, my grep skills are somehow failing me, because i can see the firewalker events in the logs, but grep firewalker shows no results…
There are several filewalkers, many of them are running regularly, or only on start.
The used-space-filewalker is running only on start, all others - runs periodically, see