Create some sparse files, then set up zfs on top of them, mount it, and create as many empty files as you desire. Then look at statistics, or just allocated size of the sparse files.
Can’t speak about details, I don’t know zfs—but this is what I’d do with any file system.
Your thinking would be correct, if you only wanted to know how much space meta data records take. In this case we want to test difference in total meta data size depending on record size, given a certain mixture of file sizes.
ZFS allocates one record for every file up to recordsize, but multiple records for any file sized above. So, with a recordsize of 1MiB all files up to 1MiB get 1 metadata record (and real size on disk size of ~2^int(log(size)/log(2)+1)). Meaning a file op 512k gets one meta record, but a file of 4.5MiB gets 5 records. Don’t know whether ZFS implements measures to reduce meta data size in case of adjacent record (feeling like extents many other filesystems).
@arrogantrabbit
Already any results back?
I rewrote the awk script once again:
for i in /storj/n*; do echo $'\n\n'$i; find $i -type f -print0 | xargs -0 -P 40 -n 100 stat -c "%s" | awk '{ n=log($1)/log(2); if (n<12) { n=12; } else if (n > int(n)) { n = int(n) + 1; } size[n]++ } END { t=0; t128=0; t512=0; for (i in size){ r128=size[i]*(i<18? 1 : 2^(i-17)); r512=size[i]*(i<20? 1 : 2^(i-19)); printf("%d: %d files; 128k-records: %d, 512k-records: %d\n", 2^i, size[i], r128, r512); t += size[i]; t128 += r128; t512 += r512; } printf("Total amount of files: %d, 128k-records: %d, 512k-records: %d\n", t, t128, t512); }' | sort -n; done
If I execute this, I’m starting to understand why meta data almost tripled with 128k records (unless ZFS implements extent-like-measures for adjacent records):
So, a recordsize of 512k quite consistently halves the amount of metadata records (it’s probably not that bad, because I didn’t calculate on file level, because that would take ages).
So, just a little bit below two times as much space used for meta data with 128k recordsize. But about 25-50% more efficiency loss. However, losses seem to be incredible high… I don’t see why.
Well, that’s what I assumed is the question. Though, I might not be aware of what these metadata records store. So…
Why does metadata size depend on file size? This is not the case in any of the file systems I know of, except maybe FAT. What is stored there, if not just the standard POSIX metadata like file name, permissions and timestamps?
With zfs metadata is understood to be literally everything that is not user data. We are talking file and directory information, allocation tables/btrees, information about pool configuration, any snapshots, etc.
If the file requires multiple records to store — there will be more metadata in the form of those records
I found a lot of information but it mostly feels like rule of thumb based on typical use cases. But storj is very special so I decided to run a real world test: 18TB HDD, 512GB ssd for metadata only, databases on another ssd. I will come back with results later.
As far as I know compression must be enabled in order to make ZFS not using the blocksize for smaller files, so compression yes.
With 4MB recordsize storj pieces would always fit into one record but we might run into trouble when disk is almost full because of fragmentation? However, to avoid this we probably need to go much smaller which leads to much more metadata. So I will try 4MB first if OpenZFS allowing that size.
Actually, if you turn on compression even if it’s zle, it apparently prevents slack for every file. I don’t know why, I mean: why don’t make zle standard in the filesystem? But it’s how ZFS apparently works.
You can see this behaviour quite easy: on my 512k datasets, the compression factor is 1.1. on the 1M datasets, the compression factor is 1.2. So essentially compression only compresses slack space.
I turned on compression on my zfs drives just to avoid slack wastage. I think lz4 won’t waste time compressing uncompressible data.
incidentally, I’m using l2arc for metadata and compression seems very effective there. 1000GB of metada compressing down to around 66GB of actual SSD space.