[Tech Preview] Hashstore backend for storage nodes

Vadim · April 6, 2025, 7:11pm

Hdd after mooving to hashstore

Fragmentation is 99.94%

Aleksman4o · April 6, 2025, 8:19pm

It’s seems to be a consequence of the migration of millions small files into thousands of large ones. Do you have an image of a node that was on the hashstore from scratch?

Roxor · April 6, 2025, 9:01pm

Ottetal · April 6, 2025, 9:26pm

I totally know what I’m looking at, of course, but there might be less technically savvy people in this forum. Could you explain it in more verbose terms for them?

Aleksman4o · April 6, 2025, 10:57pm

File system is totally fragmented. It’s causing sequential file read to be random reads and read speed very much dropping (~20mb/s average on modern HDD)

nerdatwork · April 7, 2025, 3:14am

Adding that sweet background music of

Vadim · April 7, 2025, 3:24am

I do not have literally acceptable words for this description, i know why and how it happened, it just one big point in TO DO list, after conversion.

Ottetal · April 7, 2025, 6:39am

I don’t understand why this is such a big problem? Run defragmentation software once, and you should be golden no?

I’d still much rather have a couple of thousand 2GB files than a couple trillionbillion very small files

snorkel · April 7, 2025, 6:47am

I imagined this will happen, and I am pretty sure it happens on ext4 too. Smaller files, like in piecestore, are better handled to prevent this, than bigger ones, like in hashstore. As I said, hashstore is a bad move, better stick with piecestore and badger.
There is nothing you could do to prevent this, and running defrag 24/7 isn’t going to help either.

Vadim · April 7, 2025, 6:57am

It is already running, it just will take another week, one by one. I have 8 of them. I dont even imagine how to move all other 100 nodes to hashstore, it will take enormous amount of time.
But it will take strong advantage, once i want move node to bigger HDD, copy will take several times less time. Because now move data is like 2-5mb/s

Krawi · April 7, 2025, 10:45am

Output from e4defrag on a disk (50% filled) for a hashstore folder of US1 satellite (./s0/00 to be exact):

e4defrag 1.47.0 (5-Feb-2023)
<...>
 Total/best extents                             1121/11
 Average size per extent                        10295 KB
 Fragmentation score                            0
 [0-30 no problem: 31-55 a little bit fragmented: 56- needs defrag]
 This directory (/mnt/storj/node-1/storage/hashstore/<us satellite>/s0/00) does not need defragmentation.
 Done.

It don’t seem to be a problem for the moment.

snorkel · April 7, 2025, 12:01pm

It has 50% space to breath. When will be almost full, I doubt that this score will remain the same.

Vadim · April 7, 2025, 3:10pm

for me badger not work very good, each windows update end with node not start as there is cache error.

MarviBiene · April 7, 2025, 3:13pm

I can for sure say, if you HDD has bad sectors, dont migrate to hashstore. Piecestore was fine.
Say goodby to the old old HDD that only was on life support and is now going to kill my node and dies peacfully

arrogantrabbit · April 7, 2025, 3:59pm

Why is that? Access to trillion files individually can be optimized by the filesystem. Access to trillion data pieces inside of huge blobs — cannot. It’s just another filesystem but worse.

Ottetal · April 7, 2025, 5:53pm

Some of the hardware I use is vastly inferior to what you’re running. It breaks all the time, RAID volumes degrade and I just have a habit of moving data around for the heck of it.

The increase in filesize alone is a huge welcome to me, because It will make my node migrations take much lower time, which will significantly increase the amount of tests I can run

arrogantrabbit · April 7, 2025, 6:43pm

Did you get to the bottom of this? Perhaps bad cables? Bad ram? It’s hard to imagine any hardware made in the last 30 years to be “breaking all the time”

Mine is literally old garbage nobody wanted I got for pennies from a recycler…

Few comments here:

Node shall not be optimized for this very unlikely and rare event when people not only need to copy it file-by-file, but also need it fast. Most folks never copy the node. What needs to be optimized is day-to-day node operation, and my comment was about that – inserting opaque new filesystem into the mix is hardly a good idea.
When you move stuff around move on the volume or dataset granularity; i.e. move entire filesystem, not individual files. Then it won’t matter how many of files and of what sizes are there, it will be all fast sequential IO>/

littleskunk · April 9, 2025, 5:01pm

Time for another update. Memtable implementation is soon getting merged. Here is how it will work.

There will be a flag on the storage node to switch between hashtable and memtable. Next time compact runs it will migrate. It can migrate both directions. So if you have the feeling memtable doesn’t work you can switch back to hashtable.
The migration doesn’t rewrite LOG files. Compact does 2 things. It updates the metadata for all LOG files. Lets say it puts down a few thousand trash flags across all LOG files. And than there is the LOG rewrite part for some of the LOG files. The migration will be part of the first step. This basically means even with 0 LOG files to rewrite it will migrate all the metadata into the configured format.
There are 2 hashstores per satellite namespace. So for a full migration they both have to run compact. That might take a few days. Not because it takes so long. More like a few days of waiting until compact gets triggered.

The memtable migration will create the hint files on disk that are required to rebuild the memtable on the next restart. We don’t know how long the rebuild will take. Best would be to use a SSD in combination with memtable to reduce the startup penalty.

We expect the memtable to consume about 9 bytes per piece + another 9 bytes reserved for empty entries (hashtable load factor). So for my own Orange Pi5 with 32 GB of RAM and 8 HDDs that would mean about 200 million pieces per drive. I can’t wait to run it but first it needs to get merged.

The current theory in terms of best performance would be (top to buttom)

If you have enough memory or a low amount of shared space leave the hashtable on HDD and use mmap. About 1 GiB of memory per 1 TiB of space.
If you are low on memory but you have an SSD use memtable on SSD (my personal situation) Needs about 1 GiB of memory per 10 TiB of space.
If you are low on memory and without an SSD use memtable on HDD. Not the best performance but should be better than hashtable on HDD. Still needs about 1 GiB of memory per 10 TiB of space.
If you have less than 1 GiB of memory per 10 TiB of space use hashtable preferable on SSD if that is not an option hashtable on HDD.

^ This list is just a theory at this point. Further benchmark tests are needed to find out which setup works best in which situation. This ranking might change.

nerdatwork · April 9, 2025, 5:22pm

It would seem less confusing if written as " … of memory (SSD space) per 10 TiB of space "
&
Still needs about 1 GiB of memory (HDD space) per 10 TiB of space

Example: Let’s consider 10TiB node

Node with enough ram : 10GiB of memory (RAM) for 10TiB node
Node with less ram but with SSD: 1GiB of SSD space for 10TiB node
Node with less ram & no SSD: 1GiB of HDD space for 10TiB node

Did I interpret it correctly ?

Vadim · April 9, 2025, 5:43pm

After migrating to hashstore, my node shows 2x amount of data, real amount of data and reported from satellite are more less equal.