Temperary files on Node

well sadly it’s not as simple as that… your hdd will still be writing the data in 512B or 4KiB sectors which represents an IO, and is the limitation… the advantage you gain isn’t a decrease in IOPS, but more an increase in HDD ability to process IOPS when the write is sequential.

in theory the fundamental IOPS required hasn’t changed, but since the HDD will be moving the heads around less, the latency decrease is immense.
Without a doubt i would run it at 4MiB, but do keep in mind if some uploads take 1 minute or 10minutes, not sure what the max is… then each file transferred during that time would require 4MiB

so you might run into some kind of issues with peak loads, i would suggest having a swap ssd just in case.

figured i would dig into that a bit… try and check what kind of number i would get…

october seems to be the peak month for me in recent time.
with 1.3 mil uploads and with there being 43200 minutes in a month, lets call that 30 uploads a minute as my monthly avg…

so lets say that’s okay, not like 30 files a minute is to much ever… granted some uploads can take a few minutes, so if we say 10 minutes thats 300 files avg and still thats less than 4GiB RAM usage only slightly over 1GiB.

going into detail some days have even more like oct 15th have an avg upload of 40 files a minute.
and around 50 at oct 16th… or 55 to be exact avg a minute over a full day…

that’s 10 minutes essentially filling 2GiB, on avg…
then the peaks maybe 4x that, so 8GiB … ofc in that we are assuming the 10minutes with no uploads being completed, which is ofc unrealistic… but it’s to get a sense, how much memory would actually be required at known peaks.

so at the very least i would say having less than a couple of GiB of RAM might be the limit for using the 4MiB write buffer setting, it’s not easy for me to dissect my logs to a more detailed degree, so i cannot say what the true peaks are, but i doubt most of them will out do our 3 month peak and then over a 10min period.

infact this might be what the storagenode uses RAM for… i suppose node size would also mean a lot… but my numbers is from my 14.4TiB node so atleast in the higher end of the spectrum even tho it’s quite a ways behind the leading pack these days.

and with that in mind, i wouldn’t be against running 4MiB write buffer on basically any system… i doubt, it would ever use more than 1-2GiB and that’s a high estimate i believe…

the benefits tho are increased IOPS / write throughput for the HDD, reduced latency because of less seeks, less work time and wear because of less seeks, improved reads for the same reason.
less fragmentation, which again gives more sequential reads.

this is the same exact reasons why i run a sync always policy and has an ssd write cache, so for me i’m not to sure if there is a benefit.

can’t recommend running sync always tho… unless if you got a good ssd, high sustained iops / throughput, i had to replace my 1 QLC and 1 MLC ssd that i was using as write cache, because they couldn’t keep up and was causing latency on the entire system, even tho they was in a dynamic load balance configuration, so that it would just write on which ever had the time for it… and the QLC was partitioned so that it would run as SLC only.

i would say the benefits of running higher write buffers are without a doubt an advantage, and i would set it at the highest possible advantageous number, which imo would be where one got the most usage… so … there sure are a lot of small files stored in my blobs folder… if each of them would get a 4MiB write buffer, then if somebody was writing like a ton of small files or however they are created, could quite easily go beyond the 1-2GiB avg

ofc something like KSM should be able to recognize that the allocated memory is empty after a while, i’m sure somebody has a list of filesize distribution on storagenodes we can use… my linuxfu isn’t good enough to check that easily.

yep, so as we can see there are quite a number of smaller files also, rough count seems to be somewhere inbetween 40/60 and 50/50 split, so what might happen is that, if your see a big increase in 1K,2K,4K files over a short period, it may allocated a full write buffer for each in memory and your other caches.

this is also the same reason why i limited my zfs recordsizes down to 256K, because running the higher allocation would cause my RAM and Caches to dump their data when prepping for incoming data.
so if i get 100 files in a second, it might prep for that to continue and drop 500x256K in RAM to be able to store the 5 Sec worth my Write Cache can hold.

granted that may not seem like a lot or a big issue, but i got tons of caches and it’s really annoying to see it throw out 4-8 GiB if i basically just touch the server, which was what i started seeing a lot when running 1M and 512K

so tho i would certainly advocate for increasing the number… i’m not convinced a 4MiB size might be the right choice… seems to me that it would be more beneficial around maybe 1MiB - 2.3 MiB at most… especially if it allocates the memory, which it very well might…

else i doubt it would ever run out of memory for this, and remember 256k is double as fast… so 512 is x4 and x8 for 1M, so what you are doing is essentially maybe multiplying your memory usage by 32x
for a benefit of maybe 4x speed… because past 1MiB the returns become greatly diminished.

hell you might see 95% of the same performance with 512KiB as compared with 4MiB at most likely 1/8th the memory cost.

so tho i would advocate to turn the number up if you need to and maybe even for most SNO’s…
then i would certainly be very careful about jumping the default setting by a factor of x32

it just makes any sort of problem you may run into in the future x32 times worse.

IMHO
and ofc i’m not your local witch doctor, financial advisor or psychic
just an interested builder of things.

1 Like

Your research also make sence, may be 1-2 MiB will be safer. I have about 8 Nodes per PC, with 16GB to 24 GB RAM
I have 1 node with this Setup and if other use only 18-25MB this use 25-50MB

My 14.4 TB node

The container running it’s memory usage over the last month (hasn’t been in the container for a full month, which is why it’s incomplete.
as you can see the 3rd highest peak memory usage seems to correlate with the ingress peak from the dashboard, and keep in mind this is the max graph… not the avg

The day max graph looks a bit more pedestrian, also keep in mind Docker uses KSM, which will reduce the memory required by running many nodes, this graph doesn’t include the KSM and the KSM doesn’t respond to memory demands immediately, it’s sort of a deduplication type thing on memory usage… so it takes a bit for it to correlate and reduce it from when the memory is requested.

my docker numbers look even more pedestrian.

CONTAINER ID        NAME                CPU %               MEM USAGE / LIMIT   MEM %               NET I/O             BLOCK I/O           PIDS
a5aa090c5d66        storagenode         6.47%               69.04MiB / 2GiB     3.37%               55.7GB / 527GB      0B / 0B             26

but outside of docker there will also be a bit more memory used, tho it’s kinda slight… i’ve measured it to a max of 250MB deviation from the docker usage, while at multi GB peaks while i had it at 4GB memory allowed.

the smaller nodes are still mostly just flatlined… so depending on node size, this also becomes a consideration for memory usage… however the memory for ingress write buffer would ofc be fairly independent of the rest of the storagenode usage… reboot’s (filewalker and such) also pull a certain amount of memory…

so looking in an instant, rather than the storagenode’s over time peak memory usage, might also come back to bite you in the ass.

it would be nice if we could actually test it out, in ZFS we did look a lot at recordsizes and i ended at the 256k mark, because it fit’s very nicely as a divider into the 2.3M most common file sizes.

i would start low and start to crank it up, see what happens… but you want long term data to verify it’s good… i would start at a max of 512k, ofc if your nodes aren’t super big you can most likely get away with max if you wanted… but i would set it to that as a max now… run it for a month or two and then check your numbers, maybe compare the graphs for previous months.
and then next time there is peak ingress, then i would experiment with cranking it much higher when i can monitor the results over a short period … because you might see unexpected results from setting it to high…
and that is difficult to monitor for over months long periods until the next peak ingress rolls around.

one could conclude that 2.5M would be the max thats worth doing… or there abouts… enough to fit then 2.3M files that 60% of the ingress is.

the rest are smaller, so they will not benefit from that… so 1M or 1.25M would give you two or 3 writes per file, if memory serves tho… then i think when we did the math for zfs it fit that 256 and even 512 was a nice ratio… so 512K would be 5 writes… so really setting it a 1M vs 512K doesn’t seem like the best plan… better to set it at 1.25M then… so you get it down to 2 writes because then you will have decreased 60% of your file writes by 1/3.
and at one point the writes also might get so large the hdd doesn’t really benefit from getting it in larger chunks, because it will also have it’s own algorithms and caches to optimize the writes… such as NCQ

so you really want to test, rather than just crank it… 512K seems very reasonable spot imo
would give your disks 4x the performance and like you say you got plenty of ram and thus at 512K you wouldn’t have to keep a close eye on it for months…

and then test when the time comes, if you really think going higher will help… but i kinda doubt it…

we can do the math the other way around… your hdd can do about 200-400 raw IOPS maybe even 800 read… but lets say 200 to low ball it, i doubt you can find a hdd in the last decade or two that cannot do 200 IOPS (aside from SMR)

200 IOPS x 512K means a throughput of 100MB/s and that’s a very low estimate… so no matter how chaotic it was to do those writes you would always get about 100MB/s
ofc heavy fragmentation can decrease that due to long seeks… but lets assume you are working with a not cramped hdd…

ofc if the hdd is full then you drop 50% in throughput because of the lower platter speed at the edge.
but seek is still the same… so thats just bandwidth… so on a near capacity hdd it would be 50MB/s minimum … and still you hdd should be able to do 400 iops… so still kinda 100MB/s
but lets keep it at 200 IOPS… a SMR can drop down to about 40 IOPS
so thats 1/5th… so our low ball estimate would be down to 20MB/s

and 40MB/s if we assume 40 IOPS not 20… :smiley: i figured if the conventional hdd is running at a disadvantage then the SMR should also be so.

but still perfectly reasonable numbers and increasing it … well would your storagenode ever exceed these speeds…?

maybe, but still 40MB/s with a SMR running at the lowest possible speed it can… and it’s basically random writes
if using 512K write buffer…

ofc this is estimates, approximations and generalizations… but i really doubt there is much to be found setting it higher… aside from possible memory problems.

sorry if this got extensive, but i’ve been dabbling a bit in this storage stuff lol
figured i would elaborate my reasonings… so it should make sense.

I started thinking about it and there might be a better way to handle it.

Use 2MB buffer (configurable) until active uploads reach 64MiB (configurable), then switch to 128KiB buffers (configurable)… until it’s below 64MiB again. This should allow storage node to use more memory in most cases, but handle more safely when there are a lot of concurrent uploads.

5 Likes

Hope wi will see it on next relise :slight_smile:

1 Like

Hey @Egon,
I’m not sure to follow all that exactly… but from my understanding: If the node is aware of how much RAM is currently being used by current ingress/egress operations, wouldn’t it make sense to make this configurable so the node is not allowed to go beyong a certain amount of buffered memory? (For instance 256MB for egress, 64MB for ingress, both configurable)
And when these maximums get reached, the node would start rejecting operations for preserving the amount of RAM dedicated to a node, avoiding OOM killer issues and such…

Also, it feels like it would be a way better alternative to the max-concurrent setting. At least for SMR drives it would allow them to perform great and fast while they still have some room in their dedicated CMR section, and then and only when they start crawling because they have to write to their SMR sections would they slow down and start rejecting requests.

Whereas today, I have to limit the number of concurrent ingress requests on some of my SMR nodes, which makes them regularly reject requests although they could handle temporary bursts. That’s a shame…

(Hope I’m not off-topic)

It’s rather difficult to set the exact limits, since there are other overheads in Go runtime. Even if, it exactly accounts for 128MB, then the Go runtime may keep some extra memory around. But, it certainly would be possible to make it reject requests when it hits some limit (even if it is imprecise).

max-concurrent setting is not just about the disk io and RAM usage, but also bandwidth and CPU usage. Even if your disk/ram might be sufficient, you may still want to independently limit number of concurrent requests due to slow CPU.

3 Likes

i would be very interested to hear how your SMR drives would run on 512K write buffer… i cannot imagine how that shouldn’t solve most issues you have…

the 40-50% of the files it wouldn’t help on… but the last 50-60% it would basically take the required writes and take them down to 1/4th… so if we assume an even split, it would be 4/4 + 1/4 so 5/8 the number of writes…from the default of 128K… thats a vast improvement.

sadly i don’t have any SMR drives to test on.

I tried many settings (1MB, 2MB, 4MB…), but nothing prevented my SMR drive to stall during heavy ingress loads, except for an agressive limit like setting the max concurrent requests to 4 (you read right: 4 only).

@Egon Aaah right, it makes sense to keep the max concurrent requests setting for slow CPUs. Good point.

That, in my humble opinion, would be a great improvement to the Node software (already getting better with each release)! :+1: :slight_smile:

the other SMR patch / fix is to run more than 1 node… this spreads ingress between them and thus so long as they still have capacity, the writes would be 1/n
n=number of nodes

it’s a pretty good way to fix it… well max concurrent doesn’t have to be very high… think i ran my first node on like 12 max concurrent for a couple of months, if not less… and it would barely ever reject anything, i know many have it much much higher and that can also have some limited usage.

like say if there comes like 100 transfers in one bundle, so it can be good for evening out such peaks.
but to truly restrict the ingress … well 4 is sort of 10% of the SMR general estimated max raw iops for random writes.

you should see a massive performance uplife from a bigger write buffer… tho there is ofc the chance that the 128k write buffer already gets like 70% of the performance that can be gained from the write buffer size increases and thus anything above could already be diminishing returns.

i hear the multiple node solution is a very good one also…

else get some sort of software write cache / cache and move the storagenode database off the SMR, and also log else where.

if you can get the raw write iops down, it should run pretty okay, it just gets truely terrible if it cannot keep up

I just moved most of my nodes to use Filestore.write-buffer-size=4MiB and the RAM usage of those nodes went from 40-50MB to ~100MB even though the ingress is really low at the moment. Didn’t notice a difference in iops but that’s not really visible as my drives don’t have much to do anyway.

I made for 2 MiB for safty reason.