Zfs discussions

kevink · April 19, 2020, 2:07pm

I thought I am being very clever and created multiple datasets for my storagenode:
STORJ/storage with recordsize=16K for the database files
STORJ/storage/blobs with recordsize=1M for all pieces
STORJ/storage/trash and the other directories with recordsize=1M

First: This was a bad idea because you can’t move files between datasets and therefore many uploads fail and it doesn’t work correctly… Stupid mistake…

Second: Now I have everything in a dataset with recordsize=1M which is great for pieces but the sqlite database will probably hate it… I can’t find much information about using a sqlite database with high recordsize settings. Any insights on this?
I am using an arc/l2arc and zil so the db should be cached most of the time anyway but I still wonder about the effect of storing that db on a dataset with such a big recordsize.

Third: This has been asked many times and sadly always denied… Please make it possible to move the databases into a separate mountpoint! It’s not that difficult to just give all database files their own directory…

Alexey · April 19, 2020, 2:18pm

We still reject that request. It will significantly increase the possible points of failure - up to 12 times, we will never suggest this for obvious reason.

Pentium100 · April 19, 2020, 3:13pm

I am using a zvol with volblocksize=64K (small block sizes do not work well with ashift=9 and raidz) with ext4 in that zvol with discard enabled. Seems to work OK 70% PUT success rate with less traffic and around 60% now.

The database is rather small and usually is accessed in async mode so it should work OK with a large recordsize.

KernelPanick · April 19, 2020, 3:36pm

Can you elaborate?

If this is made as a separate mount in the docker container, the container won’t start without it.

If the database is stored separately on a RAID1 SSD Array, and Storage is on a single disk, i would argue it could be a magnitude more reliable and performant.

Alexey · April 19, 2020, 3:50pm

More mountpoints - more points of failure, it’s obvious.

Storgeez · April 19, 2020, 3:54pm

Agreed, the preflight checks will ensure the software does not run without the database, and if the mountpoint becomes unavailable for some reason, well… it would become unavailable either way.
Unless I’m missing something here.

Well actually more point of failure if mountpoints are mounted on separate devices, yes, but this just raises this node’s chance of failure, the node itself is still a single point of failure from the Storj’s perspective and Storj network is already compensating for this. Nodes are still considered as fallible and could-fail-at-any-moment by the satellite.

Alexey · April 19, 2020, 3:59pm

When somebody asks to increase the probability to fail for every single node - I will reject such request.

kevink · April 19, 2020, 3:59pm

I understand the argument that more mountpoints could result in more mistakes. However, this would not be a standard mountpoint and only people “who know what they are doing” should use it.
But it would still increase the possibility of nodes failing, so I kinda understand why it is not going to happen, even though it’s a highly requested feature…

So I guess it is pointless to argue about it, because STORJ simply does not want to do it.

Storgeez · April 19, 2020, 4:05pm

I was commenting under the assumption that only a subset of SNOs would do this.

This is the same as suggesting people do not use redundant RAID because they will be wasting their space - the network is already highly fault tolerant and doesn’t care, so this is a non-issue.

Alexey · April 19, 2020, 4:05pm

Then they know, how to do it

KernelPanick · April 19, 2020, 8:12pm

You make a good point, Storj doesn’t care about individual SNOs tho this extent because they’ve already factored in the failure rate and recovery. And the SNOs get to pay for it with their escrow or failed GE. I guess this might only become discussed about seriously if repair traffic is too high.

By operating in this manner you increase reliability to the node by;

offloading IOPS to a faster, redundant storage. (the DB)
decreasing the risk for ‘slush’ storage not being enough (10%)
a watchdog immediately stops the node if the mounts are unavailable reducing risk

What i want to see is a mount like this without running other complicated symlinks
–mount type=bind,source="/mnt/SSDRAID1/StorjV3DB",target=/app/database \

BrightSilence · April 19, 2020, 8:37pm

Let’s keep this realistic. It would help with performance, but not with reliability. It adds an extra point of failure. And if you use other disks or even another array, it’s still an extra failure point.

That argument is solid. Though it doesn’t make sense to say it increases points of failure by 12x.

That said. Some nodes run into real issues that could be fixed by adding a feature to store databases elsewhere. These issues are sometimes bad enough that SNOs will have to quit without this option. Especially looking at the issues with SMR drives. The IO bottleneck can grind SNO hardware to a halt and a lot of that can likely be avoided by moving db’s elsewhere.

So yes… I could argue both sides. It’s not an unreasonable feature request, nor is it unreasonable to deny it. I think it should be seriously considered. But I’ll accept any decision that comes out at the end of that consideration. Just don’t dismiss it out of hand.

SGC · April 20, 2020, 8:46am

well it’s on your system and its open source… you just have to dig deep enough to change it i guess…
and tho i would agree with the points being made from both @BrightSilence and @Alexey, then i would add shouldn’t be an easy option, but storing the database on the OS drive shouldn’t create any real failure points and may help with the SMR issue… even tho i’m of the believe that it’s highly unlikely that its the db that makes the SMR mess up… you basically cannot use them from anything else aside from cold storage and data retrieval… so i doubt SMR drives will ever work without using something like a sizable write cache…and again… writing on the drive takes up like 100times more of the drive’s resources than reading… so it will most likely never work for live data…

Pentium100 · April 20, 2020, 8:52am

database has random writes, files are more-or-less sequential, so it may help a bit with SMR. You can probably use symlinks or similar for the db, but sqlite does not like symlinks. So, maybe the reverse? Symlinks for data directories? I have not tried this and have no reason to, at least for now.

SGC · April 20, 2020, 8:54am

my zpool for my zfs raidz1 is using 128K recordsize
think that was just default settings…

@kevink i don’t think you need to actually create multiple datasets or mount points to have zfs run different compression or such things on a folder… pretty sure thats mostly subject to configuration… not sure about recordsize tho… it seems like it might not be…

why can’t the database just run on 1M Recordsizes?

Pentium100 · April 20, 2020, 8:57am

Yes you do, compression and other settings are for a dataset.

Somewhat lower performance because the database is updated in small portions, but zfs needs to update it in larger blocks.

SGC · April 20, 2020, 9:02am

but that would mainly happen in the ARC and Zil anyways atleast with enough RAM, so i doubt it should change performance that much…

well to hell with it… i’m going to jump right in a see if i cannot extend my run time until my 2nd HBA arrives from hong kong… time to reconfigure my record size and enable compression again… lol this is going to be exciting…
got less than 2tb left on the node and then maybe i can find 1tb more to add… so thats like a week away from full at current ingress…
and i really would like to wait until i get my HBA in a few weeks … lol

i got folders in my zpool that aren’t datasets… rookie mistake i think… very annoying tho…

SGC · April 20, 2020, 9:38am

seems to run fine for now, ofc will take time for my system to adjust,
@kevink also you don’t have to worry about the 1M recordsize being to large for the database, because zfs doesn’t have a fixed recordsize, its an upper limit thus you should be fine with running the database on a 1M recordsize setting.

Pentium100 · April 20, 2020, 10:02am

That’s OK, as long as you do not need different settings for each. I have a file server and most of the files there are stored in folders and not datasets. Moving a file between datasets takes a long time, so I only use datasets when I need different settings (mount points, compression etc).

Derkades · April 20, 2020, 10:08am

I agree, the example you mentioned but also people using SMB/NFS or FUSE mounts.

If forcing people to store everything on a single drive is important, why is there a second mount point for the identity? If storing the identity separately is allowed, storing the blobs in different directory from the config files and databases should be allowed as well.