[Tech Preview] Hashstore backend for storage nodes

I’ve been running hashstore on all my nodes(24) since it was released. I do not have a piece node to compare with but what I can see from monthly payment (going up a little every month) and the incoming data is about the same so no alarming changes here. I’m in northern europe and asia.

3 Likes

4tb hashstore with 1 tb free have 4.7Gb MFT files, and 4tb piecestore with 1tb free have 22,5GB of MFT files

I have only 1 server with 15 hashstore nodes, all other is piecestore. same here, i monitoring it, even storage on same ammount(as lot of old file deletes) Egress is rising from month to month, so payouts are rising.

1 Like

I like piecestore and I do not use badger… I would rather never even have heard the concept that they’re even thinking -in any way- about ‘rolling it out to all nodes’

2 cents,

Julio

1 Like

Yes good thought as someone who is trying to copy files off two failing hard drives currently, the saving grace of millions of tiny files is that losing a few isn’t a big deal.

I haven’t been paying attention, but is the hash file size adjustable? For recovery and also file fragmentation it might be useful to keep them smallish. like 1GB or 100MB even.

1 Like

+1. I too prefer piecestore.

I absolutely don’t want the periodic compaction process, that will murder performance of my server for the duration: it cannot be offloaded and will send io to disks. My server is not a dedicated storj machine, it serves other purposes, and I vastly prefer consistent workload to bursty.

Filewalker is not a problem as it only touches SSDs. Compaction will have to send ID to disks.

What shall I add to config files to stay on piecestore indefinitely? It works, not broken — don’t fix.

1 Like

Yeah, it’s possible. Here is the utility to do that one:

https://review.dev.storj.tools/c/storj/storj/+/17475

(TLDR; each “entry” in the log file contains metadata right after the raw bytes. If you read log files from the end to the beginning, and skip data bytes, you can recover the metadata)

5 Likes

is this tool somehow automated, when node discover corruption, or you need run it manually?

1 Like

Compaction is way cheaper than the file walker. If you have no problem with file walker, compaction shouldn’t be a problem either.

Especially if you have SSD: you can move the hashtable to SSD. Compaction can be a few minutes per day… (depends on the node size / number of deleted records / etc…)

Today we have a configuration option to opt-out from the migration (see the patch https://review.dev.storj.tools/c/storj/storj/+/17819 Note: don’t use it today, it will be part of the next release.)

But the very long term plan is removing the piecestore code all together, IF hashstore is proven to be the best option on public network, too.

There is no point to maintain two different storage layer, if the one is superior. (but we are ready to fix problems, if something is still painful for the public network)

3 Likes

Today, it’s manual. It supposed to be very rare when SN looses the metadata. And even if it happens, it’s better to have human it the loop to double check if it’s not just a configuration / mounting error.

3 Likes

Aren’t they trying to replace the consistent streams of random IO (.sj1 files all over the place), with sequential writes to the 1GB log files? It’s not bursty-vs-consistent that’s the problem - as systems are typically very good at handling sequential transfers. It’s the random tiny IO that murders the performance of your server.

3 Likes

This is not necessarily true. Example: zfs pool with special device.

Metadata resides on an SSD, data on HDD. The IOPS budget of HDD is severely limited (200-300iops), while on SSD it’s unlimited for all intents and purposes. (I have a very beefy intel P3600, filewalker saturates CPU way before P3600 feels anything). Therefore I don’t worry about metadata IO, but do worry about data IO.

With piecestore we have a lot of IO on SSD, and don’t have compaction: zero IO to HDDs.

With hashstore we have fewer IO to SSD (it wasn’t a problem to begin with): and some IO to HDD

Therefore, from my perspective, hashstore is a regression: Not only it adds an “expensive” HDD IO that did not exist before, but also does it in bursty manner.

Since the system is designed (number and size of data vdevs) to support the required consistent IO pressure, there may not be enough room to absorb node’s compaction generated HDD IO without affecting other tasks.

TLDR – piecestore IO is “free”, compaction IO is very expensive.

Amount of work needed for compaction will somewhat correlate with the amount of TTL data on the node, and can be significant in the worst case.
Even if that amount is small – the bursty nature of this new IO workload may still present a problem on tuned systems.

Totally agree.

May I suggest the following then:

  • Allow to define blackout time periods when compaction is forbidden from running
  • Allow to specify max IOPS compaction is allowed to consume (can be indirectly – e.g. process no more than N logs in M seconds)
5 Likes

Reads are still random. Writes are already serialized by zfs (in a transaction group). So nothing really changes here.

However, hashstore does hide the truth about files inside opaque blobs so filesystem cannot optimize access anymore. This is counteracted with more optimized usage pattern – essentially, it’s hiding some data inside mini-filesystems, and trading optimizations by big filesystem for inherent efficiency of purpose-build data layout. For TTL heavy nodes on dedicated hardware this evidently proved to be a winning strategy. I just don’t belive it is the case for the intended storagenode hosting usecase, where node is not a primary consumer of resources, but rather a guest.

Not all random IO is the same: random IO to SSD is free. Random IO to HDD is very expensive. Swapping piecestore with hashstore reduces amount of free random IO and increases amount of expensive random IO. Even if reduction of free IO is massive and increase of expensive IO small – it’s still very bad – because net result is just additional expensive IO, however small it is claimed to be.

To your point, writing a piecestore file is less expensive than writing to the end of a log file: you don’t have a seek to the end operation. So it’s worse on all fronts, provided metadata access is free. Which it is, on well configured servers.

I think 90% of node operators dont use any ZFS with SSD cache

6 Likes

I am not fully sure about this. Assuming you write and read files, it’s not zero IO.

Maybe you talk about walker vs compaction (but Badger cache is just a cache, it stores the data what should be read at least once).

Still write/read needs IO, which is way more painful with ext4 + millions of sj1 file.

Compaction can use sequential read / write, which is easier.

I have seen monitoring data of hundred of nodes. We tried badger cache / hashstore / hashstore + ssd for hashstore metadata / hashstore + memtbl / …. We closely followed the IO pressure on the machines. Hashstore was way better, all the time. Just try it out, and collect Prometheus data.

We also test hashstore with the worst scenario (huge number of TTL data). And it works well.

1 Like

did you made any NTFS tests?

Remember, though, you have an optimized ZFS setup with metadata on SSD. The average potato node, or even a high end node that just isn’t using ZFS, may not have this setup. And if all of the data is on the disk then that random i/o like metadata updates, filewalkers and file deletion is slow.

Also, tangent, but isn’t the ZFS ARC based on blocks, not files? So it might be able to cache part of a hash file? My memory may not be right.

Right, but that is the minimum necessary baseline that needs to happen in either case. Then with piecestoreo you also have tons of metadata free IO, and with hashstore – additional albeit small amount of HDD io – seeks inside log files and copies.

I think this is the crux of the issue. I feel ext4 is an ancient filesystem not suitable for modern workloads. It sucks at many small files. It wasn’t designed for it. So, why are we optimizing for it?

Which brings me to Vadim’s and Easy Rhino points:

I think 90% of people use ZFS. Those that don’t – should. The remaining 10% – btrfs, they haven’t migrated to zfs yet, because Synology won’t let them.

On the serious node – do we know actual distribution? Is the filesystem type part of telemetry?

Is the majority of nodes seriously running on raspberry pies and various windows? Nothing wrong with it from the network performance and data durability standpoint – but where is “use what use have” and “share already online” capacity? I’d expect already online capacity to be mostly on zfs arrays, not ext, let alone ntfs. Bringing raspberry pi and old windows gaming rig online to run nodes directly contradicts to the spirit of the project.

How can it be sequential? By nature it’s read one log while writing filtered copy into the second, no? (zfs would batch writes, but not other filesystems, that seems to be the goal here)

I don’t disagree that node performs better. My concern was with:

  • if it’s already great, better than great is still great
  • what is the impact of making it greater on other services (additional io from compaction)

Ok, I will, on a few nodes.

To summarize, I agree with:

I’m sure it’s going to be fine either way.

MMm.. It’s a cheap home server from unwanted parts. I’m not running an AI datacenter at home :).

Caching is worse than metadata access being fast in the first place. My server has about 600GB of metadata. Having 600+GB of ram, let alone available for ARC is not practical. And even if it was, first access will be still slow…

for me and for lot of other people, windows and NTFS work and will work better than systems that i do not know how to cook.

2 Likes

+1 For this, and for any other periodic house keeping processes.

ZFS filesystem with multiple vdevs shouldn’t be overally impacted.

It’s the low memory, single drive system that would likely suffer, but with less files - metadata, filesystems like NTFS and ext3 will perform bettern than on piecestore.

1 Like