Storj solution for HDD 100% during scan pieces

joesmoe · February 18, 2021, 9:45pm

Yup. Also having an issue with this on my end. Wish we could just tune it down a little bit so it wasn’t overwhelming everything. Let it run at 50% the iops or something.

kevink · February 18, 2021, 9:48pm

even at 100% my 7TB node already takes 12 hours for that filewalker with no other activity…

stefanbenten · February 18, 2021, 9:50pm

Thats not as straight forward as one might think.

As everything, this is a matter of priorities. The current code is functional (definitely not optimal), but it gets the job done and arguably even correct to run like this.
We have an internal ticket to get back to this as soon as time allows.

Lets be honest, we all would like more customers to drive more payout rather than silent disks (because no data and/or usage), right?

kevink · February 18, 2021, 9:53pm

it’s more of an iowait issue than noise for me.

no way of adding some small sleeps between different subdirectories or files?

I agree of course. Lots of more important things than an annoying filewalker… Still wish it would be configureable to not run at all so I can start it externally or something so not all my drives to it at the same time.

stefanbenten · February 18, 2021, 9:55pm

In terms of adding sleeps, that would apply to everyone. We definitely want to add more configuration to it, but this requires tests and such.
That said, we definitely can link to the right pieces of code and hope for PRs
Is that an option?

EDIT: It may also be beneficial, to read the files in chunks and process them in batches, to make the IO more sequential.

kevink · February 18, 2021, 9:59pm

maybe that’d help, idk. Personally I don’t have any golang skills to do that. But maybe someone does and is annoyed enough to create a PR

stefanbenten · February 18, 2021, 10:00pm

Thats the power of being open source and having an awesome community!

Eioz · February 19, 2021, 5:50am

Hello dear all,

100% HDD usage is not dependant of SATA disk type, all storagenode may have this utilization after a node restart or a storagenode update or randomly.

In the real world of storage user must pay for LIST requests that will generate intensive IOPS. The scam is that requests are not done by tardigrade users but by the storagenode code itself.
A big cloud provider actually is paid at least for :

Storage (MB/GB/TB)
PUT/COPY/POST/LIST Requests
GET/SELECT and Other Requests

On cloud side never you will see directory traversal executed server side “pieces scan”… It’s only a decentralized cloud thing because providers don’t manage directly the hardware. Imagine if big cloud providers achieve a recursive lookup for files stored in datacenters ? They dont ! because they are confident with there hardware as using RAID or multiple controlers on multiple sites… For me storj shoud pay for every IOPS made on the hardware (request in object storage world…).

Storj should really find another solution to ensure the integrity of the data than intensive pieces scan, at least reduce the intensiveness of the scan (which is not paid actually) by the code, by system cache, by exporting dbfiles to another path or other solution.

Imagine do a ls -recursive on a object storage volume with some millions of files and how many list requests fees you will to pay ?

I am available to continue to clarify this and to help with my IT infrastructure skills.

Best regards.

kevink · February 19, 2021, 6:05am

You’re comparing sticks to cars…

The code is open-source, feel free to make changes and run your own version of a storagenode without the file walker if you are unhappy with storj not paying you for IOPS…

Or raise a PR to improve the situation.

But never ever will you get paid for IOPS lol

Besides: Not every cloud storage provider is demanding payment for operations. Some may limit the amount of IOPS, sure, but there are still lots of providers that don’t charge your for put/del/ls/etc (but I guess all s3 or similar object storage providers might. My point is, you can have cloud storage without paying for iops).

SGC · February 19, 2021, 7:27am

if we are getting paid by the iops how much to we get…

my ssd cache can handle 100000-200000 sustained 24/7… it’s basically brand new so i’m sure there is lots of life in it…
paided by the iops seems fair to me lol… screw that it’s like paying your taxi driver by the MPH
or gear shifts… doesn’t have to make sense, so long as i can compete lol

how about this… we get paid by fan noise… wattage draw… so the best setup would be a RPI in an electric kettle with a hdd on top

the filewalker sucks… thats just how it is… especially if there isn’t enough iops to take from…
but the filewalker doesn’t have to run very often, in fact it only has to run like every month or every reboot.

just add a mirror hdd

Pentium100 · February 19, 2021, 7:38am

It helps my server.
After all, filewalker does not read the file data (there is no point in doing so, since there is no way for the node to know if the file is corrupt or not), it just scans the metadata.

The virtual disk IOPS load after updating and restarting my node:

Some other time:

Last 7 days:

As you see, filewalker usually finishes in less than an hour. The prolonged increased load on February 14th was because a scrub was running at the time (zfs runs it once a month by default).

My server has 106GB RAM, 16GB is given to the node VM. Zpool is on the host, the node stores data inside ext4, which is inside a zvol.
The pool is made of two 6-drive RAIDZ2 vdevs, most drives are CMR 7200RPM, two are SMR 5400RPM (I am going to replace them soon). The pool has SLOG (which helps writes only), but no L2ARC.
The node has 17.83TB of data stored (according to CLI dashboard).

kevink · February 19, 2021, 7:57am

I don’t see a comparison in your post?

plenty of performance… I have a single 5400RPM drive, took almost 12 hours. Makes sense that your setup takes only an hour. 2x6-drvie RAIDZ2 with 7200RPM drives probably has 12 times my IOPS.

That’s true of course.

no L2arc means only RAM cache… granted you have 106GB of ram but still… it might help a bit but not enough to say that a cache solves the iops problem.

SGC · February 19, 2021, 7:59am

my l2arc holds a lot of meta data and it seems to be chiefly what benefits the filewalker…
or atleast thats how i have understood it.

the l2arc cannot hold all the data, but the metadata doesn’t take up much space and i guess it saves a ton of iops… when scanning through everything.
or thats my interpretation, if anyone knows better i would like to hear your views / opinions on what’s going on.

there is that small dedicated metadata thing one can run also on zfs… but if that dies the pool dies…

and from what i can tell the l2arc seems to try to perform the same function if it has plenty of capacity.

from my tests the filewalker does take a bit more than an hour, more like 2½… but the initial peak in iops isn’t very long… like 15-30minutes peak io activity and then it drops down and continues semi amped for like pentium says an hour or so and then drops down to a still increased level but only slightly over the norm until completely fading out after about 2½ hours.

tho it’s been a bit since i tested it, but with the new update i will soon be running it on my system which is pretty active.

Pentium100 · February 19, 2021, 8:23am

filewalker in my node (over 17TB of data) finishes in under an hour (at least looking to the IOPS load graph), where it ran for 7 hours for the OP with, presumably, less data.

RAIDz vdev has about the random IOPS of a single drive. Linear reads (zfs send) would be fast, but random reads - not so much.

If I could trigger filewalker manually, I could drop caches and see how that works. However, after rebooting the VM (and especially after rebooting the host), even starting the node up takes some time, where now it is pretty much instant.

kevink · February 19, 2021, 8:28am

oh 17TB, yeah looks a bit faster then.

The raidz does only have the random iops of a single drive? I already forgot how the iops are managed on raidz since I switched my setup… Maybe we can compare read ops directly, during my filewalker run the drive had around 80 read ops at 100% utilization.

You can by restarting your vm/node. But you can’t drop your caches as you only have the RAM cache arc, well unless you reboot the host, then the cache is empty.

My filewalker took almost 12 hours with an empty cache after a host reboot. Not sure how long it would take now that the 8GB l2arc is filled but that might be a bit small for a 7TB node.

Pentium100 · February 19, 2021, 9:17am

This is taken from the VM, so if something was cached in the host, it would look like the virtual disk being very fast.

I have restarted the how, we will see how long filewalker runs now. Keep in mind that the caches will start to fill up pretty much immediately.

SGC · February 19, 2021, 10:37am

i think i got a ton of tests i did on this a while back, but it’s on a downed OS on an offline hdd so will need to look for it…

but if memory serves, then there is a big difference between no l2arc and l2arc…
my l2arc ssd driver failed during a kernel update because i hadn’t use dkmg or whatever its called correctly… so i ran some tests on the filewalker process while i was there anyways…

it was many times faster, pretty sure i got a few screenshot for comparison.
should also be very visible with the l2arc empty… but might be slightly faster than with no l2arc ,
but not a lot… might just be down to my l2arc size and the storagenode reduced iops on repeat reads.

Pentium100 · February 19, 2021, 11:11am

After a reboot it took aboutthe same time to run

The problem is that cache started filling up very quickly

And I always got a rather good hit ratio:

(these two graphs are from the host, so they account for all VMs running there, there is no way to find out how much cache or what the hit rate is for a single pool, let alone a zvol).

kevink · February 19, 2021, 11:45am

well it is my understanding that the cache only fills up with things loaded from the hdd, so it doesn’t actually have much impact on the filewalker because everything the filewalker loads, will end up in the cache but isn’t actually read again afterwards.
But your read ops seem to be around 300 read ops per second, that’s at least 5 times mine.
Still surprising that it is so fast. (or at least that’s how it should be, theoretically… haven’t looked at the filewalker code)
If the cache had such a big impact, I should see the same benefits. Your arc is using ~18GB at the end and I have 8GB arc and 8GB l2arc so should be around the same (more or less) and still it took me almost 12 hours. ok maybe because I had 3 hdds run the filewalker at the same time. That could have taken some arc and bandwidth, idk how big that impact was. But my 7TB HDD always needs more than just an hour for the filewalker, even with full caches.

Pentium100 · February 19, 2021, 12:09pm

It also tries to prefetch some data. In addition, the ext4 block size is 4K, but zvol block size is 64K, so when 4K of the virtual disk is read, 64K of data is probably cached, if there’s soem metadata in those 60K and it’s accessed later, it comes from the cache. Also, some metadata (for directories) may be accessed multiple times, I don’t really know.

There was a high cache hit ratio just after the reboot though.