[Tech Preview] Hashstore backend for storage nodes

DAMELDOR · December 24, 2024, 8:13am

How to execute this script in Windows?

Alexey · December 25, 2024, 5:55am

the same way, but need to use a PowerShell prompt.
echo command is an alias of Write-Output command so can be used as-is, but for echo -n (to do not have a CR at the end) you need to use Write-Host -NoNewline instead of echo -n.

JDA · January 1, 2025, 6:30pm

I have started a new Node with Hashstore enabled and it look promising! I’ll try do compare all performance metrics with similar nodes in file mode once I get enough data.

Just a quick question, from what I can see the log files look to be allocated “as needed” and not at one time. It definitely helps on the number of file stores on the file system (and all the issues we add with walker, trash etc.) but allocating the files like this does not fixe all the fragmentation issues as the 1GB files will have may fragment even though they could be one block (if the drive have enough contiguous free space)

alpharabbit · January 1, 2025, 6:49pm

Log files don’t stay forever. If deleted pieces reach 25% the remaining pieces are copied to a new file. So preallocation doesn’t make sense.

For a dedicated disk they could preallocate the entire disk but I prefer to use the free space for other things until storj needs it.

JDA · January 1, 2025, 10:22pm

The problem is that if you fill a disk with 1GB files heavily fragmented (current implementation), when one of those files is deleted, it will leave lots of small space all around the partition and when a new 1GB one is allocated to receive the data of the one reaching the 25% threshold it will get even more fragmented because you don’t have any contiguous space.

I have been dealing with similar problem for a long time now and the ideal scenario to minimize fragmentation is to allocate files at the creation if you know the final size.
Another option is to allocate by a fixed amount high enough to minimize fragmentation. Ie, for a 1GB file allocate by 64MB block.

Alexey · January 2, 2025, 4:58am

Pieces have a various size, so it’s hard to pre-allocate without wasting too much space.

JDA · January 2, 2025, 5:22am

It was the case before when we stored individual files, but with the hashstore you know the final size of the logfile. And you are going to “waste” some space anyway with the 25% deleted threshold.

Maybe adding a parameter similar to pieces.write-prealloc-size than can range from 0 to 1GB, that way node operators can do some test?

PS: Currently on the fresh node, with an empty disk, full 1GB logs files have around 1000 fragment each

Edit:
After storing 30GB of Hashstore on a new empty drive (NTFS 4k) this is what the fragmentation looks like:

Fragments   Bytes             Clusters
----------- ----------------- -----------
     35 671    29 984 243 356   7 325 287

Average fragment size is around 0.8MB; I’ll defragment the drive and let it run without increasing the available space to see how it behave.

Alexey · January 2, 2025, 6:49am

I shared your idea with the team.

Alexey · January 3, 2025, 3:11am

The creator of hashstore:

well, here’s some unordered thoughts:

we fallocate the hashtbl because we care about linear scans, random lookups, and prefetching being efficient, and it’s a background operation and we know the fixed size of the table up front.

even so, in my testing, the table still has significant fragmentation, so it’s not a silver bullet to just pre allocate. in other words, it’s unclear to me how much pre allocating even helps fragmentation

i generally trust filesystems to do a pretty good job at solving this problem. like, compared to the old system, fragmentation should be better because we at least give the filesystem the opportunity to batch things up

fallocating the log files introduces high latency spikes because they are lazily allocated. this can lead to runaway performance problems because if you get a high latency spike and do a ton of i/o that can increase your concurrent load, which then causes more lazy allocations, leading to more i/o leading to …

i don’t think fragmentation in the log files is a very important metric because id expect individual pieces in the log files to be relatively unfragmented due to their smaller size, and reads don’t span multiple pieces. like, 1000 fragments in the post means about 1 fragment per megabyte which is around the typical piece size

fragmentation could make unlinking the file slower (guessing) but we hardly do that operation anymore with the hashstore design

it would introduce significant complexity because right now we can rely on the log file size for knowing where to append, how much data is stored in it, etc. pre allocating would mean having to store that data elsewhere and now you have two things that need to be independently synced which is non-atomic and oh gosh what a can of worms to open

overall, i don’t think this idea is a good tradeoff

snorkel · January 3, 2025, 7:40am

So, if we switch to this hashstore, we should allocate less space, to account for that 25% and databases?
Like, for a 20TB drive, we should allocate 14.5TB?
I don’t see it like an acceptable tradeoff. I will stick with the current mode and badger on.
(Quick remark: the title has a typo… hashtore instead of hashstore)

Alexey · January 3, 2025, 7:57am

No, nothing is changed there. Just some data would be in an old backend, some in a new (if you migrated the node). For the new one configured with hashstore it is expected to work as before, just faster.
I would expect to have less space wasted on databases though.

25% it’s a threshold when the compaction would start. Perhaps it may be possible to configure it in the future. Right now it seems as an optimal setting to keep a balance between wasting space and IOPS.

snorkel · January 3, 2025, 4:53pm

I got it. You allocate the space as usual; it will occupy the entire space, but with holes, untill compaction, when pieces are deleted.

JDA · January 3, 2025, 5:32pm

I can only speak for the filesystem I know best (NTFS) and I acknowledge the fact that other file system works differently.
NTFS is quite good at allocating file if you give him the final size first. It is quite bad at “growing” file and very bad if you ask it to grow in small increment.

Again, only for NTFS, this is not true. Unless you ask the file system to zero the bit on allocation, it is very fast at doing so. That’s the way SQL server and others do it and you have 0 latency.

I’ll try to confirm this, or at the very least show a inprovement over the current implementation

On NTFS it does, and it also make the MFT larger (slower and use more RAM)

I forgot about this part and he is completely right on this one

Bonus question: Do you have any idea of the smallest IO size in write?
To know it it’s worth it to test with 64k formats block.

Alexey · January 5, 2025, 4:58am

Do you mean the log file allocation? I do not think it works like this. It appends pieces, while it’s possible and marks deletions (only in hashtable, the log remains untouched, so no holes there). When the amount of deleted pieces is greater than 25%, it will run a compaction. It’s how I understand it. For TTL data it is simply going to a separate log file also grouped by the TTL date, and when pieces are expired, the whole log got deleted at once.
If you mean the disk allocation, then there is no holes, except usual FS fragmentation.

JDA · January 5, 2025, 10:18am

On the exposed metrics used_space{field="recent"} does not take into account the space used in Hashstore. And it’s the same on the dashboard display.

I found the correct(?) value using the query:

sum without (satellite) (hashstore{field="LenSet"})

And the sum of the 2 give the expected total:

sum by (instance)(
used_space{field="recent"}
or
hashstore{field="LenSet"}
)

Is it the expected behaviour or an early bug?

Alexey · January 6, 2025, 3:07am

Hashstore in an alpha stage, so the used space would likely incorrect.

sembeth · January 6, 2025, 5:27pm

@Alexey would you be so kind to fix the title?

snorkel · January 6, 2025, 7:30pm

… before posts start poping up about “what’s Hashtore stands for?”

heunland · January 6, 2025, 9:13pm

would you be so kind to fix the title?

I fixed it

littleskunk · January 7, 2025, 4:07pm

All of my nodes are now fully migrated to hashstore. I started the migration over a month ago I believe.

Compact usually takes under an hour. Garbage collection takes just seconds. Upload success rate seems to be about the same compared to the old piecestore implementation. So yea it seem to work really good on my Pi5 setup.

Edit: My logs are saying the migration processed about 2 GB of data every 10 minutes. A 10 TB node would take about a month. So that should give you some numbers to calculate how long it will take for your node size.

I don’t have numbers about the amount of garbage that is still below the threshold and therefor not freed up by compact. Maybe I can visualize that on my grafana dashboard.