[Tech Preview] Hashstore backend for storage nodes

How to execute this script in Windows?

the same way, but need to use a PowerShell prompt.
echo command is an alias of Write-Output command so can be used as-is, but for echo -n (to do not have a CR at the end) you need to use Write-Host -NoNewline instead of echo -n.

2 Likes

I have started a new Node with Hashstore enabled and it look promising! Iā€™ll try do compare all performance metrics with similar nodes in file mode once I get enough data.

Just a quick question, from what I can see the log files look to be allocated ā€œas neededā€ and not at one time. It definitely helps on the number of file stores on the file system (and all the issues we add with walker, trash etc.) but allocating the files like this does not fixe all the fragmentation issues as the 1GB files will have may fragment even though they could be one block (if the drive have enough contiguous free space)

Log files donā€™t stay forever. If deleted pieces reach 25% the remaining pieces are copied to a new file. So preallocation doesnā€™t make sense.

For a dedicated disk they could preallocate the entire disk but I prefer to use the free space for other things until storj needs it.

3 Likes

The problem is that if you fill a disk with 1GB files heavily fragmented (current implementation), when one of those files is deleted, it will leave lots of small space all around the partition and when a new 1GB one is allocated to receive the data of the one reaching the 25% threshold it will get even more fragmented because you donā€™t have any contiguous space.

I have been dealing with similar problem for a long time now and the ideal scenario to minimize fragmentation is to allocate files at the creation if you know the final size.
Another option is to allocate by a fixed amount high enough to minimize fragmentation. Ie, for a 1GB file allocate by 64MB block.

2 Likes

Pieces have a various size, so itā€™s hard to pre-allocate without wasting too much space.

It was the case before when we stored individual files, but with the hashstore you know the final size of the logfile. And you are going to ā€œwasteā€ some space anyway with the 25% deleted threshold.

Maybe adding a parameter similar to pieces.write-prealloc-size than can range from 0 to 1GB, that way node operators can do some test?

PS: Currently on the fresh node, with an empty disk, full 1GB logs files have around 1000 fragment each

Edit:
After storing 30GB of Hashstore on a new empty drive (NTFS 4k) this is what the fragmentation looks like:

Fragments   Bytes             Clusters
----------- ----------------- -----------
     35 671    29 984 243 356   7 325 287

Average fragment size is around 0.8MB; Iā€™ll defragment the drive and let it run without increasing the available space to see how it behave.

1 Like

I shared your idea with the team.

1 Like

The creator of hashstore:

So, if we switch to this hashstore, we should allocate less space, to account for that 25% and databases?
Like, for a 20TB drive, we should allocate 14.5TB?
I donā€™t see it like an acceptable tradeoff. I will stick with the current mode and badger on.
(Quick remark: the title has a typoā€¦ hashtore instead of hashstore)

No, nothing is changed there. Just some data would be in an old backend, some in a new (if you migrated the node). For the new one configured with hashstore it is expected to work as before, just faster.
I would expect to have less space wasted on databases though.

25% itā€™s a threshold when the compaction would start. Perhaps it may be possible to configure it in the future. Right now it seems as an optimal setting to keep a balance between wasting space and IOPS.

I got it. You allocate the space as usual; it will occupy the entire space, but with holes, untill compaction, when pieces are deleted.

I can only speak for the filesystem I know best (NTFS) and I acknowledge the fact that other file system works differently.
NTFS is quite good at allocating file if you give him the final size first. It is quite bad at ā€œgrowingā€ file and very bad if you ask it to grow in small increment.

Again, only for NTFS, this is not true. Unless you ask the file system to zero the bit on allocation, it is very fast at doing so. Thatā€™s the way SQL server and others do it and you have 0 latency.

Iā€™ll try to confirm this, or at the very least show a inprovement over the current implementation

On NTFS it does, and it also make the MFT larger (slower and use more RAM)

I forgot about this part and he is completely right on this one

Bonus question: Do you have any idea of the smallest IO size in write?
To know it itā€™s worth it to test with 64k formats block.

1 Like

Do you mean the log file allocation? I do not think it works like this. It appends pieces, while itā€™s possible and marks deletions (only in hashtable, the log remains untouched, so no holes there). When the amount of deleted pieces is greater than 25%, it will run a compaction. Itā€™s how I understand it. For TTL data it is simply going to a separate log file also grouped by the TTL date, and when pieces are expired, the whole log got deleted at once.
If you mean the disk allocation, then there is no holes, except usual FS fragmentation.

1 Like

On the exposed metrics used_space{field="recent"} does not take into account the space used in Hashstore. And itā€™s the same on the dashboard display.

I found the correct(?) value using the query:

sum without (satellite) (hashstore{field="LenSet"})

And the sum of the 2 give the expected total:

sum by (instance)(
used_space{field="recent"}
or
hashstore{field="LenSet"}
)

Is it the expected behaviour or an early bug?

1 Like

Hashstore in an alpha stage, so the used space would likely incorrect.

1 Like

@Alexey would you be so kind to fix the title?

1 Like

ā€¦ before posts start poping up about ā€œwhatā€™s Hashtore stands for?ā€ :sweat_smile:

1 Like

would you be so kind to fix the title?

I fixed it

5 Likes

All of my nodes are now fully migrated to hashstore. I started the migration over a month ago I believe.

Compact usually takes under an hour. Garbage collection takes just seconds. Upload success rate seems to be about the same compared to the old piecestore implementation. So yea it seem to work really good on my Pi5 setup.

Edit: My logs are saying the migration processed about 2 GB of data every 10 minutes. A 10 TB node would take about a month. So that should give you some numbers to calculate how long it will take for your node size.

I donā€™t have numbers about the amount of garbage that is still below the threshold and therefor not freed up by compact. Maybe I can visualize that on my grafana dashboard.

4 Likes