Looks like some improvements are in the works

jammerdan · October 23, 2024, 5:38am

Could that be huge?

https://review.dev.storj.io/c/storj/storj/+/14910?tab=comments

andrew2.hart · October 23, 2024, 6:08am

Could be good for shingled disks

Ambifacient · October 23, 2024, 6:15am

How do deletes work, do you need to read the entire log file, process the deletes, then write it back to the disk?

I wonder how much write overhead there would be if this is the case.

arrogantrabbit · October 23, 2024, 6:25am

I’d assume just like SSDs delete data. hint — they don’t. At some point read/modify/write will be needed.

I am personally strongly opposed to all and any attempts to reinvent filesystem features in the application. If an app needs to write a file — just write a file. Don’t do batching/striping/logging/appending/anything else trying to workaround perceived filesystems [mis] behavior. Just do what you need to do, and let the filesystem handle performance. It’s designed to do that. It has been tested on a billion of hours on millions of devices. Storagenode reincarnation of some hackery won’t be.

For example, there are absolutely no issues with node performance on zfs even on previous decade’s ancient hardware. None. There is no need to complicate the node software and risk data loss for what benefit exactly? To run node on a potato for a day?

These shenanigans will never be as reliable as any underlying filesystem. So how about we stop wasting development time reinventing bicycles and optimizing things that work just fine?

Two people using ancient ext4 or NTFS (who runs sever software on any windows seriously today, come on) are not worth degrading node reliability for everyone. We are talking customer data here. .

Remember, we went away from using databases in the past because it was too fragile. Now we are back on the dangerous track of reinventing database. Just stop. Don’t repeat the history.

Same is true about badger cache. It’s not needed. It may (and did? I reminder reading some thread about it) cause issues. Throw it away. Let the filesystem do its job. and if it does not — tune it.

Don’t try to make storagenode into its own OS. That’s a too big a piece to bite.

Alexey · October 23, 2024, 6:48am

I believe that this new storage wouldn’t be a default any time soon. At least it’s a too early stage of testing approaches.

To use ZFS you need to have a lot of RAM and run it on a more powerful hardware (and more expensive regarding either a cost of itself or/and power consumption). It also doesn’t fully match to use what you have now. Even a low power device can run a node, and we wants to be it so.

arrogantrabbit · October 23, 2024, 6:57am

That’s not true. I’ve experimented with rasbetty pi 4 with 8GB ram, HDD, and an SSD as a special device. No issues. Ran great.

Use what you have does not mean it shall run on anything you have.

People stringing raspberries pies together is not a backbone of the network. It’s just toys. People who have “more powerful hardware” - -like home or SMB NASes is what needs to be focused on. These are inherently more reliable, persistent, and less likely to drop dead when the cat sneezes nearby compared to PI with all those GPIO exposed. Pi is a prototyping device, not a production server.

Storj can collect telemetry and figure out the fleet of devices it runs on, normalized by space/bandwidth. I highly doubt raspberry pies are a backbone of the network.

Compared to raspberry pi? yes. But it already runs anyway. It does not matter what it consumes if it’s already online and available.

Not if this takes away precious development resources only to improve performance of 3% of nodes by volume while reducing reliability.

Does storj have stats on the CPU/RAM/ARCH/bandwidth of the devices running the nodes today, albeit virtual? Maybe there it is OK to drop raspberry pies? It would be much cheaper. Or just do nothing and wait until raspberry pi 9 with 34 cores and 128GB of ram is released. It’s best to avoid writing code that does not absolutely have to be written.

Alexey · October 23, 2024, 7:14am

And with RPi 3b+ 1GB without SDD and one HDD too? I doubt so.

Why not, if it’s meet the minimum prerequisites?

Maybe, but they use much less energy than anything else. This is a good choice for those ones, who wants to learn Linux and earn something on top. So, even if they run it for fun, it would be still useful. At least my RPi3b+ was able to pay for all four nodes electricity and my internet subscription. For me it was useful. So, why not?

This is probably true. However, we accepted anyone, who have a spare space and bandwidth.

In this case, it is exactly in line with our ideas of sustainable development.

It’s not only for a small amount of slow nodes. It’s also to reduce costs on hardware, if someone would build HW only for Storj (yes, we against it, but we are not blind). And we want to reduce the footprint, not increase.

Maybe, I didn’t search enough to find that stat. I usually used a downloads stat from the GitHub.

This one implementation proves an increase in a performance. We want to have a high performance of our nodes. Anyone node. Even from a week devices. The high distributed nodes are our strength, why do not squeeze from it every ms which we can achieve?

Pentium100 · October 23, 2024, 8:10am

Well, as long as it does not work as badly as the storage method in Storj v2 it will probably be good.

There are some people, with large MSSQL databases…

I’m using ext4 on top of a zvol (node runs inside of a VM, the virtual disk is a zvol). I thoght that zfs on top of zvol would be bad for performance, but maybe it would not have been as bad.
I also do not know if this new method would improve the performance of my setup.

Roxor · October 23, 2024, 10:11am

I’m with arrogantrabbit on this: that’s simply not true.

ZFS’s ARC does benefit from having more memory, but large amounts aren’t needed. The special-sauce for Storj is its ability to easily and natively use a SSD as a ‘special metadata device’ to speed up almost everything heavy a node does. Like filewalkers complete in seconds/minutes instead of hours/days.

Anyways… if this change means welding all our millions of current .sj1’s into fewer/larger files so HDDs spend more time dealing with sequential transfers… I’m intrigued. But I’m interested in what happens for deletes: as it sounds like you either swiss-cheese those new larger files with tons of IO punching holes… or you rewrite them with their new contents (but benefit from those rewrites being sequential).

Hmmm… Copy-on-Write sematics… where have I heard that before?

andrew2.hart · October 23, 2024, 11:17am

When it comes to filesystems, I agree with Bart Simpson…

nyancodex · October 23, 2024, 11:17am

100% agree, ZFS brings more headaches than what it can offer, especially on low end hardware.

arrogantrabbit · October 23, 2024, 3:15pm

No, the disks will still need to deal with same amount of IO, likely more:

now to fetch a specific piece metadata lookup would be needed (from ram), then seek, then read data.
If you have big blob: metadata lookup for the blob, then seek to blob, then read where in the blob is that price/seek to it, then read.

You’ve just added an extra step or two that filesystem cannot optimize because it has no idea about your second layer secret filesystem.

This is actively hurting performance.

The only usecase that will improve is copying node data as is to another volume. Hardly a burning performance issue.

alpharabbit · October 23, 2024, 3:29pm

I was always thinking this project could benefit from having a choice of piece stores. So this is great news to me. Can’t wait to try it.

arrogantrabbit · October 23, 2024, 3:33pm

I would not be so enthusiastic. Piece store in v2 based on very well proven databases (pretty much as good as it gets in terms of reliability) collapsed catastrophically, and it was determined that there is nothing more reliable that plain old files on the disk. That’s what we have in v3.

Now the history seems to be forgotten, and someone tries to move away from plain old files on disk to what essentially is a database but made from sap and twigs.

I seriously have no clue how this exercise in futility got allocated development time. Aren’t there architecture reviews? Prof of concept work the changes?

If team has extra time to burn there are more pressing things to address — like still broken updater on FreeBSD (no, im not filing yet another bug report), broken plots in the UI on page refresh, and gigabytes of logs per day that node generates. The whole logging needs to be overhauled. It’s 2024, use tracing. Especially if the alleged reason for this database nonsense is concern is to run on weak hardware. Fix logs. Then you’ll maybe realize filesystem isn’t a bottleneck).

alpharabbit · October 23, 2024, 3:38pm

So far the best solution I found is ZFS with special device or l2arc for metadata. This works great but it needs an additional ssd. The proposed piece store could work without such ssd in my opinion.

Vadim · October 23, 2024, 3:39pm

having 30 millions of files on 4-6 TB hdd is also not very efficient. I think it will be best way to store 4k file or 8k file size, depends on HDD size, configurable will be the best way.

arrogantrabbit · October 23, 2024, 3:42pm

Exactly. The only people who suffer through windows are those who are locked in into windows ecosystem/software. That is not accidental.

A line shall be drawn somewhere. 8GB rpi is as good of a line as any.

flo82 · October 23, 2024, 4:17pm

I was using zfs with mirrored vdevs.

Switched to ext4. Cause: doubled space.

Now switching back to single vdevs (hdd) with mirrored special devices for metadata (ssd). Cause: slow listings of files of ext4 and too much i/o on metadata.

zfs is set to metadata cache only, so no much RAM needed. Filewalkers are blazing fast. I/O is primary on ssd.

arrogantrabbit · October 23, 2024, 4:57pm

NTFS? It’s not an appropriate filesystem.

Hiding part of the data inside files is just moving money from a bank account to cash in the mattress – now banks can’t see it and can’t optimize your cashflow. Those pieces still need to be found and addressed. But now not by the stable filesystem code, but by new contraption storj invented, that would prevent filesystem to optimize data flow.

There is absolutely no reason for storage node to workaround deficiencies of the ill-fitting filesystem.

Few comments here:

There are filesystems that don’t waste space on partial sectors, if this is what you are worried about.
Why stop there? Why not take this approach to the max: let storagenode create a large 10 TB sparse image file with that custom Storj-designed filesystem inside. Because that’s what is being suggested, just on a smaller scale. And you can now see how silly it is – storj shall not be in business of designing filesystems. There are already filesystems available that do the job just fine. same thing. We just did full circle.
NTFS is not a good filesystem to begin with. ReFS could be the one, but it will never happen. Because it’s not easy. Stroj can’t pretend that they can do better as a side project.

Why not? Filesystem’s literary one job is to store files. In fact, storing 30M files on a single filesystem is better than storing 10K containers with 1000 files inside on a secret filesystems – because host filesystem now cannot optimize access to the files in the hidden filesystem: It cannot distinguish data from metadata, among other things. Your cache misses skyrocket, thrashing skyrockets. Everything collapses.

If there were optimization to be made – they would have already been (and are) made in the host filesystem itself.

There are modern filesystem that have no issues with hundreds on millions of files. NTFS is not one of them. It’s not storj’s job to fix shortcomings of a filesystem that is used on a small minority of nodes. (I don’t have numbers, I assume it’s minority, because nobody will willingly be putting up with windows, unless they have to, due to legacy software, like aforementioned MSSQL server (but then VM exists to run that legacy software – so there is no excuse really).

I think I made my position on this topic very clear, and it baffles me that nobody else sees the obvious futility of the proposed approach. I don’t really care, but it would be a shame to waste development time on a clear dead-end instead of moving the project forward.

arrogantrabbit · October 23, 2024, 5:06pm

Why do you mirror special device if you have no redundancy to begin with? Use old enterprise SSD with PLP if you don’t already.

Exactly.