Optimal database(s) write size?

skookum · March 27, 2022, 12:08pm

Caveat: I am not a developer/programmer

I use a little program called “ioztat” to monitor the per-dataset read/write statistics for my storj node. I’ve noticed that the databases are very consistent in their average write operation size of 2 KiB. Most zfs users likely have their pool’s ashift set to match 4K or 8K sector sizes which I believe means that these database operations are fairly inefficient at using the dataset space since the sector size is the smallest operation size that can be used.

Is the 2KiB write size deliberate? Even non-zfs users would likely have disks with 4K sectors so I’m curious if that’s a design consideration for buffering writes in 4K amounts instead.

Average read/write IO stats for my storj dataset since creation (about five weeks ago):

           operations    throughput      opsize   
dataset      read  write   read  write   read  write
----------  -----  -----  -----  -----  -----  -----
tank
  storj         0      0  8.53K  23.8K  83.0K  40.6K
    dbs         0      0  3.63K  1.62K  3.98K  2.00K

Cheers.

littleskunk · March 28, 2022, 11:25am

It is a bug and not a feature. We are testing out different settings that should work better for storage nodes but that also has impact on performance and durability. So we need to balance it out carefully. You can follow this: RS testing · Issue #4424 · storj/storj · GitHub

SGC · March 28, 2022, 12:48pm

had a bit of trouble figuring out exactly what this meant…

Reed Solomon Code, ofc it was that… but well i’m not that use to hearing that yet so RS didn’t unacronym for me immediately lol.
basically this stuff.

can’t say i really understand the details of it… lol, does make me wonder a bit why going down in scale on the RS scheme, would make the databases writes larger.

but i suppose that is down to divisors, like say the difference between using a base10 vs using a base12 numeral system.
its much easier to divide stuff in a base12 than in a base10 without getting into decimal numbers.

like say 10 can be divide by, itself, 5, 2 and 1.
while 12 can be divided by itself, 6,3,2 and 1.

might not seem like a big difference, but its kinda huge and why time keeping is based on 12.

you wouldn’t happen to be able to explain why reducing the RS scheme numbers by over 50% makes database writes more well suited to the minimum 4k sector sizes of most storage today.
in a way we can understand?

littleskunk · March 28, 2022, 12:52pm

64 MB / 29 pieces vs 64 MB / 16 pieces

SGC · March 28, 2022, 1:04pm

right so it’s a bit like the age old issue with raid / storage.

the fewer disks / pieces one has the fewer slots for parity data.
which is why the new scheme could have a big effect on durability.

why was 29 pieces used initially… and not like 32? i mean 29 is a very odd number…
it’s a prime which makes it sort of special later on as one scales up… but theres also literally no lower divisors for it.

and why go down to 16 pieces now, when 32 atleast from my limited understanding would get about the same advantages, without offsetting the entire scheme.?

littleskunk · March 28, 2022, 1:43pm

Our data science team was responsible for picking some numbers. They have taken performance, durability and a few other factors into account. We picked these numbers before we had any network at all. Nobody thought about the impact on storage node hard drives. Even after the first uploads that didn’t change. I would say round about the first performance tests with millions of uploads we noticed the “mistake”. We uploaded millions of ideal file sizes but it was not ideal for storage nodes.

Higher RS numbers will impact the size of the satellite DB. We believe we can reduce the RS numbers and still maintain a high durability. It takes time to test that theory.

SGC · March 28, 2022, 1:53pm

ah right… the metadata issue… i’m very familiar with that “little” problem, because higher RS numbers create more pieces and each piece has to have its own metadata.
metadata being a database entry… to help keep track of everything.

basically the same reason hdd / storage sectorys went from 512B to 4Kn.
which reduces the metadata required by a factor of 8.

kinda massive, i initially ran my zfs storage on 512bytes, but had to move to 4k due to the massive metadata memory foot print.

certain sounds like your theory is right way to go, from what i understand of all this voodoo