Large temporary folder

my system is overkill… i run my main node on currently on 10 drives, 1 x 1tb pcie caching ssd and then 3 raidz1 thus rasing my raw iops to thrice that of a regular zfs setup… and on enterprise sata… it’s not because it causes me any real trouble, but i just feel it’s a waste of resources that it runs every time i restart a node… which will happen from time to time…

i don’t see how it’s an advantage to not run the filewalker for lets say 3-6 weeks which will soon be my regular uptime… really only reason i would restart the node would be to update to a new version and then i might skip one or two if there isn’t anything i want in the update.

so that will run the filewalker once… while maybe dealing will ingress of millions of files making mathematical deviation very likely long term… or for whatever reason… the longer we move away in time, the less accurate a theoretical space accounting would be…

but if i restart my node and then try to fix something give it an hour or two and the restart it again, then it will have done space accounting twice… checking every damn file, which it literally already had checked once… that seems wasteful, and not having a schedule for the filewalking also seems … well i duno whats the worst that can happen… but i suppose somebody might run out of space…

today storagenodes can barely work on some SMR drives, and right now if people have issues and try to reboot their node during high traffic… well :smiley: it would only make it worse.

okay so long story short…
with my method i don’t think the drift would be bad because if one uses the file time stamps then one gets an exact time, and the time the filewalker takes to run will not become a deviation.

secondly there should be a time stamp on the filewalking so that when it’s been run once… it will not run again for atleast a day or two… depending on the drift… maybe even longer…

may give it a parameter in the configration so people can set it how they like… which would also mean people could disable it through that… tho i’m not sure it’s wise to give people such options… because it will eventually cause problems down the line… ofc it being open source software its not easy to know what others might do with the technology in 10 years… and so the ability to modify stuff is always good.

maybe not so short lol

Maybe I’m not following your solution, but whether you’re trying to get the date or the size of the file, you would need to do directory traversal to get either. So I don’t see how querying timestamps would be any faster than just querying the file sizes and sum them up?

Skipping the process at start could also lead to issues if someone moved their node to a different file system. Perhaps it is possible to only check free space on the file system and double check that original data is still there (either by sample or using the verification file in the storage location with the node id). That might be a way around it.

i’m saying that the whole problem with nailing down the size is due to the filewalker takes time to run and while its running the size of the storagenode changes, thus giving an inherently sizable variation, thus if one uses the file timestamps one can define an exact time down to the second of when the total space was accounted for… and then from that point one only needs to do addition and subtraction of the ingress and deleted.

i’m not saying it should be skipped completely… just that there is no point in running it 8 times in a row, if one keeps rebooting a node or a server otherwise has issues…

Oh, the change in space during filesize iteration hasn’t been a problem at all so far, as far as I am aware. We don’t do exactly what you describe, in order to avoid communication overhead between subsystems (e.g. when a piece is being deleted, we want to avoid needing to check whether the filesystem is being traversed at that second, and if so, whether the traversal has already passed the place where the blob is being deleted or not. Timestamps can’t solve that part).

Instead, we take a snapshot of the current size counts before traversal starts (call that s1) and after traversal is done (call it s2). Then when traversal is done we take the size counts that we found (call this N) and add half of the delta (s2-s1) to N, so the final values that we store are N+(s2-s1)/2.

The 1/2 part of that is because of this: we expect new blobs to be stored and old blobs to be deleted while we are traversing the directory structure. But some of the new blobs stored during iteration will be seen by the iteration, because they are stored in a part that the iteration hasn’t passed yet. On the other hand, some of the new blobs will not be seen by the iteration, because they get stored in a part that the iteration has already gone past. Statistically speaking, we expect half of the bytes from new blobs to be included in the new traversal count, and half of them not included in it. The same is true of deleted blobs.
Half of the blobs deleted during traversal time will be seen by the iteration, because the iteration passed them before they were deleted, and half of them will not be seen by the iteration because the deletion happened before the iteration. Therefore, with the assumption that blobs received and blobs deleted are randomly and evenly distributed in the keyspace, and the assumption that each top-level directory in the blob space takes about the same amount of time to traverse, the correct total count is N + all_bytes_from_new_blobs/2 - all_bytes_from_deleted_blobs/2. And since s1 + all_bytes_from_new_blobs - all_bytes_from_deleted_blobs = s2, we get the count we want from N + (s2 - s1)/2.

This has the interesting property that it gets improved accuracy from both sides: When this is running on a node that doesn’t have much data, iteration is very fast and there isn’t much time for any large deviance to build up. But as we add more data, causing iteration to take a longer time, the estimate N+(s2-s1)/2 gets more and more accurate because the assumptions above are more likely to be reliable due to the Law of Large Numbers.

I certainly agree with that. Maybe one of the suggested mitigations could make things easier for people who need to tinker with the storagenode platform, until we have the resources to implement something better.

2 Likes

the snapshot solution is a pretty good one… does that mean it utilizes the zfs snapshot ability or how would one actually take a snapshot easily… i mean right now i’m migrating my main storagenode, because as i learn i get new ways i want to try and do my setup…

these days tho… its gotten to 14tb in size… and migrating it… well just reading it takes hours
the fastest i can go through all the metadata of the blobs is like 5-15minutes don’t have an exact time for that… but if it takes 15 minutes, and 15 minutes at the end… for comparing snapshots… then in theory atleast… much of my resources may go towards the snapshot thing…

because doing like a df command on the blobs folder doesn’t take as long as the filewalker does… actually takes a lot less…

yeah my solution doesn’t really deal well with the deleted part of the files… but using the timestamps has the advantage that there isn’t any inaccuracy because it’s a infinitely small point in time one would use as the defining factor of when is to stop or start.

migrating the node from 3 hdd span and back to 4 drive raidz1 but at current speed it will take something like 18 days to complete… simply because of the iops required…
took 6 days copying between 3x raidz1 pool (3 x 1hdd iops worth) to 3 drive span (3x 1hdd iopsworth)
and now copying to 1raidz it seems to be taking 3x as long… which makes good sense…

ofc i just started it so going have a long time to gain or loose ground… but it’s an iops limitation, so would make sense that its easy math.

but it goes to show how difficult the storagenode data can be to deal with…
the 6 day almost 7 day transfer could have been done in 10hours or so at max sequential speeds.
so it’s a bit difficult for me to imagine the data being handled with ease, and having it scanned every time i restart the node…

you say the method is resource light, but is that also true at the extreme end up storagenode capacity.

even doing the filewalker would take like well 45minutes to 1½ hours is the current time it takes usually… depends on if its a storagenode reboot or a server “cold start” reboot
if it’s a cold start it takes 1½ hours, and thats on a pool that has 3x the iops of most zfs pools, so that would or should take 4.5 hours!!! from one starts the storagenode… until it’s done doing its “boot sequence”… yes i know that some doesn’t consider it a boot sequence, but since it happens when you start something… well then…

it’s very clear on my iowait, every time i boot the storagenode… it takes 45min -1½ hour for the iowait to drop below 2% avg again… sure it’s not super heavy usage like just when it starts for the first 15minutes or so, where it will usually peak in iowait… at maybe 45% then after that it runs at about 20% avg for the last period.

and the reason i have such a big change between cold start and warm start is that i usually run a sizable l2arc so any file that has been used twice since last reboot will be stored in the l2arc… thus all metadata will after 1 storagenode boot be stored in the l2arc pcie ssd and after the 2nd storagenode boot will not leave for months… i also find it a bit odd that i don’t think i’ve ever seen it actually run the filewalker while the storagenode has been running for extended periods… i was looking for that when i had my node running for 45 days…

but i may have missed it… not always easy to see a 45 minutes 15-17% change in iops on the graphs

sorry for ranting…but i got 18 days to burn waiting for my rsync to complete… i should try to change to blocksize to rsync max of 131072 but ill do that tomorrow… so i get a better avg speed… to see how much it helps xD if anything… seems to tho

18days!!!
imagine having a 14tb hdd and it’s smart says its dying… and it would take 14 days to move out the data… if one could reduce that just by 50% it might save a lot of storagenodes from being lost and SNO’s from a lot of grief

hmmm this is kinda interesting, hadn’t bothered with a slog device on the temppool because i figured since it was on sync standard and was only being written to i didn’t want to bother with it…

now that i got around to adding that and checking the avg speed while copying from 3 drives in span to 4 drive raidz1 the speed went from 8MB/s to 42MB/s which was exactly the same speed i was getting when copying the last time between 3x 3 drive raidz1 pool, so 3x1hdd iops…
now i should only have 1 hdd worth of write iops and i get the same speed… which seems a bit weird… but might be because the slog device makes the writes sequential and then i’m random read iops bottlenecked in reads… ofc the storagenode is also running, so that also shaves a good bit off the top from the pool being read.

so down to 6 days again… oddly enough hadn’t expected that… but zfs can be a bit weird sometimes.

No, it’s not a snapshot of the filesystem, sorry if I explained that badly-- it’s just a snapshot of the current size counts (there is a different size value stored for each satellite, both in terms of bytes used on disk and paid bytes, and another value for blobs in the garbage directory). The snapshot I mean is just a copy of each of those numbers.

Yeah, df doesn’t traverse a directory structure, it just interrogates volumes (filesystems) about how much space they have used and free. It’s essentially just one system call per volume. My suggested mitigation #2 above (a config item that tells the storagenode that it can use the whole filesystem) is one way we could get away with doing the same thing as df: a single statvfs() call instead of traversing the directory structure to determine the total amount used.

If you want to compare a system tool that does roughly the same thing as the filewalker, use du -s or ls -lR.

Yes, that would be a fine way to get an up to date count if it wasn’t for deletions. Unfortunately for this scheme, deletions do happen, and not taking them into account would generally cause more inaccuracy than what we have.

I would guess that you have more data than the typical storagenode, because there don’t appear to be many that take 4.5 hours.

I think we’re all agreed that it would be nice for that not to have to happen every time the storagenode starts, especially if it already has some valid recent totals to use.

1 Like

i’m still pretty new at linux, so i don’t really follow the top part to well.

my point of the snapshot concept is that where do you get the number of the total data…
it takes time to run through the files… i think i timed my non cached df and that took like a good while to scan

yeah i’m amongst the largest possible nodes currently on the network, because i joined shortly after the reset and have kept free space for it, and on fiber with near 24/7 up time and usually close to 99.9% successrates… the 4th 9 have been giving me trouble tho… xD

ended up getting a bit distracted while writing this response so took a few days… meanwhile my node migration finished and i’m now running my 14tb storagenode on 1x hdd iops…

so far i’m 2 hours in since i started the storagenode on this pool, and it’s still haven’t settled down to it’s “normal” iowait, but i ofc doesn’t quite know what my normal iowait is at the moment, but i doubt it’s this… looks very much like the same iowait’s i observed on the old 3x hdd iops pool.
which is why i suspect it’s not done doing whatever it’s doing during boot…

grabbed some screens of the graphs
sigh 3 hours in and still not done i think… will be very interesting to see how long the scrub takes now… at 1/3 the iops

best guestimate says there isn’t more than 1hr left until the storagenode is returns to regular operation… can’t really start my scrub test before … maybe i should have scrubbed before… but lets be fair… that would mean i should scrub for 3 days maybe and then when i’m done verifying the data i can rsync it again because it won’t be exactly the same as the active storagenode…

so decided to trust it.
here are the screens of the graphs

this is the one from the old pool running the storagenode only tho it was cached in the l2arc… so pretty much optimal conditions at 3x hdd iops.

new 1x hdd iops pool storagenode boot no cache

as you can see the raised iowait are over a significantly longer timeframe.
the normal normal server iowait with the storagenode running is below 1%… maybe 0.5 or 0.25% avg
ofc this is not a huge load on the hdd at 5% iowait… but if one was rebooting the node all the time…

of if we think of the 14 day planned rollout, thats like 350 hour so the minimum current filewalker activity on a 30tb storagenode would be running for close to 6 hour+ before the node is done booting… thats literally 2% of storagenode best possible uptime… and then if we imaging having problems just 4 reboots and it would be 8% out of which 4% the high activity mark…
missed when it dropped down because my avg is i guess 3 times higher… so no real surprise there, odd that the peak isn’t higher but maybe thats a priority thing or whatever… takes 3 times as long tho, from what i can tell with limited testing… but pretty much as i would have expected…
noticed that my dips in the avg now that was at it lowest was down to 0.1% which i would suppose means the disks was basically idle, thus done with whatever it was doing… took hours to get there tho… from the proxmox daily avg graph started the node at 8:30 and it was done around 12:30-13:00
as you can see the avg levels out at way below 5% then at 15:30 i started my scrub…

image

the ealier iowait is from the last of the node migration, finished at 23:00 the night before or something like that and then i set rsync to run again so it was semi ready this morning, which is the island between and the reason that the iowait goes up before node boot, ran a 3rd rsync before switching the running node off and did the run command for the migrated on… at 8:30 list previously stated.

makes the data a bit annoying to read, but just wanted to verify the numbers that i expected, which was pretty much spot on…
sorry if it looks like a bit of a rush job, because it was… want to get the 2nd 4 drive raidz1 connected to the pool so it can start to slowly balance out over both raidz’s and thus share the IO load on the raidz…

my successrates have dropped… :confused: because the iops this one raidz1 can do is to low to serve all incoming download requests.

scrub speed is also as expected…

zpool status
  pool: bitlake
 state: ONLINE
  scan: scrub in progress since Sat Oct 17 15:30:08 2020
        4.93T scanned at 860M/s, 670G issued at 114M/s, 17.3T total
        0B repaired, 3.78% done, 1 days 18:28:57 to go
config:

        NAME                                         STATE     READ WRITE CKSUM
        bitlake                                      ONLINE       0     0     0
          raidz1-0                                   ONLINE       0     0     0
            ata-HGST_HUS726060ALA640_AR31021EH1P62C  ONLINE       0     0     0
            ata-HGST_HUS726060ALA640_AR11021EH2JDXB  ONLINE       0     0     0
            ata-HGST_HUS726060ALA640_AR11021EH21JAB  ONLINE       0     0     0
            ata-HGST_HUS726060ALA640_AR31051EJSAY0J  ONLINE       0     0     0
        logs
          fioa2                                      ONLINE       0     0     0

errors: No known data errors

ofc thats just an estimate, but it’s about 3 times higher than the 14-16hours i would usually see on my semi well balanced 3x3 drive raidz1 pool

so yeah… i haven’t really had a big problem with the filewalking at boot, but i could see it as being a potentially annoying issue for big nodes troubleshooting or whatever… i fully expect it to be run… i just don’t see why it should run repeatedly every time…

maybe just add a timestamp to it, so that if it’s been run once then it will wait a while before it runs again at node boot… or do so that if the node has been rebooted multiple times in like a day, it will postpone the filewalker / space accounting thingamajig

i’ve reduced my iops now, hopefully that won’t affect my performance to much and if i want to fix it i just add another 4 drive raidz1 to this pool but i doubt it will be an issue since 2x iops of single hdd setups should be more than plenty… but we will see how my successrates are after 1st quarter next year, which is how long i expect it will take before the pool will have balanced the load…

and then l2arc will have to pull its weight until then…

interesting enough either adding the l2arc helps my scrub… or it doesn’t require as much IO as i would have thought… not that it’s really relevant…