EXT4 And Storagenode Free Space & Optimization

svet0slav · March 14, 2024, 8:41pm

Hi, guys! I am wondering whether I should increase node sizes.

How much free space do you leave on EXT4? Any optimizations necessary for EXT4 in general?

JWvdV · March 14, 2024, 9:10pm

I’m reserving about 7%, maximum 100GB on every node.

But really: these are really minor optimizations. Not making you winning the race.

daki82 · March 15, 2024, 11:18am

I go with ~1TB on my 12 and 20TB (10 btw 16TB reseverd for node data) drives, 100GB sounds bad, sometimes there is more than 100gb in trash.

JWvdV · March 15, 2024, 11:54am

But thrash counts in the node size, although you’re not being paid for it.

snorkel · March 15, 2024, 2:43pm

I think the “overused” should be one of the factors took into account. Not trash.

daki82 · March 15, 2024, 4:04pm

Ah, you know about the 10% recommendation. Because just in case…filewalkers n things.

JWvdV · March 15, 2024, 4:16pm

Yeah, I know.
But 10% of 20TB is obviously too much, and actually I never had a situation in which overuse was more than sone 100MBs.

daki82 · March 15, 2024, 4:54pm

So why not 5% instead of 1%.?

We will not come to a new recommendation. So the personal opinion is all we have.

arrogantrabbit · March 15, 2024, 4:58pm

Why? The recommended amount of free space is specified in percentages and not absolute bytes on purpose. The larger the volume, the more breathing room you need to have for the filesystem to do the job effectively. And those thresholds depend on the usage. For static data, like Storj, 5% will suffice, according to various sources, but I"m not an expert on ext4

Generally, when you are approaching 70% of storage used, it’s time to add more storage. Approaching the 90% volume utilization is not healthy for the filesystem and performance.

JWvdV · March 15, 2024, 7:51pm

Why, no data backing up that it should be linear proportional.

There seem to be several reasons why they advice the 10%:

To account for the additional space used for the orders, databases and identity files. But these are only partially related to the size of the node (download orders), the remainder aren’t. Moreover is only several 100MBs.
To prevent fragmentation on an extent-based file system. But this is also depending on the size of files you save. Since we’re talking about file sizes of about 1MB on average, some GBs will be more than sufficient for this if necessary at all.
In order to have some space, for bugs to happen; for example A) if there are files being downloaded although the node has already been been filled up. But these problems will arise in smaller nodes she’s before they happen on bigger nodes, so you’re already been warned far ahead of not already been fixed before they become troublesome. This actually also happens when you delete the databases on the node, and the used space filewalker had to rebuild the usage statistics. Or for example B) if files aren’t being deleted, although already deducted from the usage statistics anyhow. But on bigger nodes, there is older data which tends to be less dynamic than younger data. So also these problems will be sooner troublesome on smaller than on bigger nodes.

So, essentially the space reserved is taken proportional to the disk size the node resides on. Although, I can’t think of any process or bug, that’s likely to behave proportional to the size of your node/disk.

Therefore we’re reserving space for some unforeseeable event (I at least can’t foresee), which leaves the question why it should be proportional (1, 3, 5, 8, 10, 15 or 20%) to disk size (why not square root of disk size, which would be statistical more likely). And why not just fixed 50GB?

I think, that 10% is made out of thin air.

JWvdV · March 15, 2024, 8:05pm

May be, but question is whether it’s relevant for STORJ. I mean, the performance of internet speed is a magnitude lower than disk speed. So it might probably never be troublesome.

Besides, the degrade of performance might be more proportional to usage than fullness. See for example: https://www.usenix.org/system/files/hotstorage19-paper-conway.pdf

Besides, performance loss is also part of the disk geometry: the start of the disk is at the outer side, and the written to the inner side. Therefore track speed at the start of the disk is significantly higher than at the end of the disk. See for example Large Spinning Hard Disk Performance Study •

So, it’s what you mean by healthy.
It’s very normal to have a performance degradation using a HDD when the disk filled up. Part of the design so to say.

Alexey · March 16, 2024, 10:16am

I would add, if the node detect, that it has less than 500MB of free space, it will reports as full to the satellites, independently of the specified allocation.
However, if we introduce a bug here…, well - you are doomed.

P.S. I’m guilty, I specified the allocation to have 100GB of free space on my 9TB of total offering…
However I seen scary messages that I have 100GB overused (I still has several hundreds GBs of the free space on that drive), so…

snorkel · March 16, 2024, 11:37am

We need guys like you that are willing to get their hands dirty and follow the forbidden paths, so we can discover the “what if…” and the limits of storagenodes.
Thak you for your service!

Toyoo · March 16, 2024, 4:49pm

One more reason: to account for file system overhead, which is not easy to measure. This includes things like cluster size, file metadata storage.

JWvdV · March 16, 2024, 6:54pm

O yeah, but then we already passed at least one or two other bugs and/or thrown away databases.

For sure, meta data normally is far below 1% and in most filesystems already (partially) reserved beforehand. And therefore isn’t reported as usage space.

Cluster size thing is actually uprounding of the file size. Most file systems are using a cluster size of 4096 bytes. With an average of 1MB files on STORJ this means you need to take (4KB/1MB/2 =) 0.2% due to cluster size overhead.

So, accounting for this all, you probably only have to reserve about 2% with an absolute minimum of about 1GB to be safe; and then it’s even a quite spacious taken safeguard.

But I still don’t see any substantial underpin off the 10%, which still seems to be made out of thin air.

snorkel · March 16, 2024, 8:16pm

You must think about pros and cons. If you reserve 500 GB, you loose 0,75$ per month, but your node will most likely never crash. At least from bad allocation.
If you reserve 10GB, you gain like 0.75$ per month more, which will make you rich in no time , but most likely your node won’t survive too much.
So ask yourself… is it worth the risk?

arrogantrabbit · March 16, 2024, 11:59pm

This 1MB assumption is two orders of magnitude off:

From 11TB node distribution of sizes looks like so:

  1k: 3312920
  2k: 3890418
  4k: 3472822
  8k: 5493570
 16k: 5553461
 32k: 7377023
 64k: 3034593
128k: 4795218
256k: 2811797
512k: 461677
  1M: 323325
  2M: 2322707

The median file size is therefore 16k, not 1M. So your cluster size overhead will be over 12%

Mitsos · March 17, 2024, 2:34am

IMNSHO: percentages don’t make sense when talking about large drives (I’m talking 20TB drives). I’m not wasting 2TB (=$3, literally two months electricity for the drive) for “what ifs”.

Alexey · March 17, 2024, 8:09am

Do not forget to account databases - they are not accounted in the used space.

JWvdV · March 17, 2024, 8:58am

See:

In this case you need mean -as in average- file size. And not median. Taking your data it’s about 170KB, which means just 1.2% if the filesystem doesn’t support inline data (or only very small like ext4) in case of block size of 4k.

In ext4 you can decrease block size to 1024b or 2048b when formatting, although I would only consider it on devices using 512b sector size.