Looks like some improvements are in the works

In theory it is possible to write a filesystem specifically for Storj that performs better than a general filesystem. For example, storj-fs does not need the ability to modify the file (piece) - just create and delete it, no need for permissions or some other features that other filesystems have.

Is it worth it? I do not know. Maybe it would help improve performance for the slower nodes , maybe it will end up like Storjv2 (much worse than using files). The way it is described, it seems like it would work better on SMR drives, but I have no idea how it would work in practice.

A bad sector in data drive means one corrupted file and one possibly failed audit, the node can still run. A bad sector in the metadata drive would probably be much worse, at least from my experience of a hard drive getting a bad sector in the MFT (NTFS).

2 Likes

Hello! Iā€™m the one who designed and implemented the hashstore that was linked. Iā€™m going to try to respond to some of the critiques in this thread. I honestly do appreciate the discussion and I believe the best ideas are forged in the fires of open discourse.

One of the main arguments Iā€™ve seen presented is along the lines of ā€œthe filesystem is designed for this and has much more investment, so how can you possibly do better?ā€ The main reason I believe so is that filesystems have to solve a much harder set of problems. For example,

  1. They must present a file hierarchy, which naturally leads to a tree like structure necessitating at best O(log n) access times.
  2. The file contents and structure are mutable. This enormously complicates the solution space.
  3. They must comply with POSIX semantics, or something similar.

For the piecestore, we donā€™t have to do any of those things. We have immutable, write once data in a flat namespace that doesnā€™t need any ordering guarantees. A hash table would be the perfect data structure, providing O(1) reads and writes, which matters when you have a large number of files (hundreds of millions).

The other part of the design that is important is that currently we have many background processes that have to walk over all of the pieces and do work. Specifically, garbage collection and used space determinations come to mind. The hashstore unifies all of the operations that have to walk the pieces into the growth/shrinking of the disk backed hash table, where each record for a piece fits into 64 bytes. This is much smaller than the internal data structures for inodes for any filesystem, and the reads/writes happen sequentially. This will drastically decrease the amount of iops necessary for the node to do itā€™s background processing work, and cuts out many thousands of lines of code.

For what itā€™s worth, I agree that all of the caches, badger, the TTL databases, and other ancillary structures bolted on to attempt to make operation of the node performant are not the best, but they were added for good reasons: people found performance lacking, and the easiest step to take was to add a little system to make it faster. Do that enough times and youā€™re stuck in a local optimum. This is an attempt to move to a better local optimum than we currently have.

21 Likes

As I understand, this could be solved by having a database with the pieces, their size etc somewhere so there would be no need to periodically walk over all files.

All in all, I do not know how the new method would perform, especially on my system, maybe it would be better than plain ext4.

How are you going to deal with block sizes? Typically ext4 block is 4K, on NTFS the cluster can be up to 64K. Reading/writing unaligned data would be slower, but aligning it would waste some space.

This is exactly the thing that the hashstore is trying to avoid. You still end up needing to periodically walk the pieces in case the database becomes out of sync, or you could state that the database is the source of truth, and that design is effectively the hashstore: a central index of metadata about the pieces, and a set of files containing the piece data.

The first obvious optimization is to combine pieces into larger files so the underlying filesystem has to do less work in its index structure (the inodes). Then, every database will do O(log n) reads/writes whereas a hash table is O(1). The reason why normal databases donā€™t use a hash table is they have to solve a harder problem that requires things like ordering guarantees, etc.

Finally, because weā€™re implementing the hash table ourselves, we get to hook into the background chores where the table needs to grow/shrink to simultaneously do garbage collection and TTL processing, which minimizes the number of iops and greatly simplifies the implementation:

  1. Thereā€™s no need for the hash table to support deletes, so no tombstones, etc. The hash table becomes append only, which is much simpler than any database.
  2. There are no updates necessary to flag pieces as trash or removed, reducing iops
  3. It minimizes the amount of data read in that there are no queries to find pieces that are TTL expired, etc.

Iā€™m not. As argued above, filesystems are really good and Iā€™m relying on them to do a good job reading byte ranges out of files.

3 Likes

Filesystems are good at doing that for separate files, but in this case they donā€™t see separate files, just one big file. So, there is no reason for the filesystem to optimize writes so that certain variable-length pieces of the file can be read faster. What may end up happening is that if there is a bunch of small pieces stored that are unaligned (letā€™s say the block size is 4K, first piece on the large file is 7K and all the others are 8K) reading them will cause a longer read that would take more time (in this example, reading 12K).
I do not know how uch this would actually affect performance, but for larger block sizes (NTFS can get up to 64K) this may be significant.
Also, on a filesystem with 4K blocks, appending, say, a 15K file would cause read-modify-write on the (currently) last block. That may also reduce the performance.

Wouldnā€™t you still need to do the O(log n) read to open the large file in the first place. GET requests for pieces will likely be randomly distributed and not concentrated in the same large file.

I think the only way this could be true is if the filesystem is internally padding in the same way that it would have to be padded in the hashstoreā€™s log files. I hadnā€™t considered this, but Iā€™m unconvinced that the benefits from aligning reads outweighs the advantages of a more compact layout: better space utilization, perhaps more cache hits, etc.

That said, it wouldnā€™t be too hard to ā€œoptimistically padā€ where if you only need to append a few bytes to get back to 4k alignment (or whatever), it does so. Iā€™ll ponder this more.

All of the files containing piece data are always held open by the process, so no directory traversal has to happen. Iā€™m presuming that reading a random offset into a file is O(1) iops, so a get will do an O(1) read from the hash table, then an O(1) read from the log file that is indicated as compared to an O(log n) directory traversal to open a file handle for the piece and an O(1) read for the piece.

1 Like

So happy to see this work! Iā€™m having less and less time for playing with side projects, but I hope Iā€™ll find a second to look at the code.

The commit message does not state so, but if I may recommend something, designing for a potential recovery of the hash table file from log files would be nice. Like, if you can somehow detect that the hash table file is corrupted (e.g. due to an unclean shutdown), it should be possible to recover the hash table file (let say, without the trash flash) by scanning the log files themselvesā€¦ the relevant data is likely recoverable from the piece header. Even if it will be a long process, itā€™s still better than starting from scratch.

4 Likes

Yep, this was added after the commit message was written. I should review the message and update it :slight_smile: . Thereā€™s no exported API surface for doing it, but there is a test that it should be possible: https://review.dev.storj.io/c/storj/storj/+/14910/18/storagenode/hashstore/store_test.go#578

One design decision of the hash table is that records inside of it are either valid or invalid as determined by a checksum. Specifically, there is no such thing as an ā€œemptyā€ record: their checksum will be invalid. This means that corruption in the hash table is localized to only the pieces that were affected. The read code path attempts to be resilient to invalid records in the probe sequence by ensuring it reads a few pages before reporting an entry is not in the table. This is another advantage over other more complicated index structures that have more catastrophic failure modes.

2 Likes

This depends on how fragmented the file is. Modern file system usually have an extents tree to store information on location of file parts. We should expect this tree to have a small number of nodes if it is sequentially written, but, let say, hole punching or writing to multiple files concurrently may prevent that and allocate sectors alternating between files. Then the tree might grow.

Obviously exact implementation depends on the file system and operating systemā€¦ ext4 puts some attempt to not allocate sectors right after the last file extent to other files exactly to support the sequential writes case, but I donā€™t know how good the result is.

Iā€™m more thinking about a case where an interrupted write erases a large chunk of a hash table file. If itā€™s possible, maybe the code should check checksums of all entries at startup?

2 Likes

How big are those files? It is potentially a lot of open files. I donā€™t know if this would have some bad sideffects though.

And yeah, those files may be fragmented.

Indeed, but I expect that due to the relatively small number of log files, and that they will have active open file handles, that these trees will be in memory and not have to be read from disk to determine where to start reading. Even in the worst case where it does need to read from disk, Iā€™d expect a good filesystem to succinctly store this information so that it wouldnā€™t need to read much from disk to get the info.

I picked a number out of a hat and got 10GB. Youā€™d like it to be large enough that you donā€™t have too many files, but small enough to make rewriting not super expensive and blast radius of a lost file smaller. On a 10TB drive, this will only be about 1k open files which seems reasonable to me.

2 Likes

In any case, you canā€™t avoid having O(logN) somewhere. Either itā€™s an extent tree, or the btree/htree storing directory entries, or the multilevel directory structure. What you can hope for is that all these data structures are small enough to be kept in RAM cacheā€”and with this design I do find this likely to happen, so thatā€™s great.

4 Likes

I donā€™t think it has to go somewhere. Given the log files have a maximum size, and an extent has a minimum size it allocates, that puts an upper bound on the amount of extent metadata. That makes seeking O(1) iops in the size of the file. Additionally the ns being talked about are a different. The O(log n) when talking about the inode/tree traversal is talking about the number of pieces, and seeking into a file obviously doesnā€™t scale with the number of pieces, making it O(1) in the number of pieces.

All that said, I think weā€™re mostly in agreement. Iā€™m trying to use O-notation to describe algorithmic differences in the layout: a hash table + bounded seek compared to a tree traversal, and as you noted the real thing that matters is how many iops happen in practice.

3 Likes

Iā€™m using some old 128gb ssds and the TBWs are very high. I think they wonā€™t make it. So the probability that the high io ssd dies first is much higher than the low io HDD.

If the concern is endurance related wear, and you wear out SSDs at the same rate ā€” mirroring them is 100% pointless. Mirror helps guard against uncorrelated failures.

I would throw away any consumer SSd and spend $10 to get enterprise one, used, of the same capacity regardless of wear for two reasons:

  • well designed SSDs fail readonly. Itā€™s 100% harmless ā€” metadata will just continue being written to main array, and you will replace SSD when that happens, no hurry
  • PLP will save the day if you lose power.
2 Likes

Bingo! This is what I about to write, so I wonā€™t, and instead continue the thought:

We are in the best case trading smaller memory footprint (all those structures have to live in ram ā€” otherwise we are back to 10ms latency on every access) for increase in code complexity and maintenance work in the future.

Option one: do nothing. Next generation hardware will have more ram. No new code means no new bugs and no new chores. The problem will solve itself by us doing nothing.

Option two: write new code. This will lowers overall stability, distract dev efforts fform
Other tasks and will need to be maintained, debugged, supported, forever.

Iā€™d choose ā€œsit on my hands and waitā€ to ā€œdo extra work + create extra work for the future for dubious benefitā€. I would never create solution with indefinite maintenance requirements for a temporary problem.

Iā€™ll address the rest of @zeebo points separately later tonight as they deserve more substantial response than on the go ranting can afford.

This will overall be less code and complexity when you account for all of the extra data structures and caches necessary for the ā€œjust store the files on diskā€ strategy to work. We need to do garbage collection, TTL cleanup, trash + restore management, disk space usage estimation/accounting, and probably more. These things became a big enough problem that we have code to run some of them in a separate process with a lower I/O priority, and that persists its state so that it can make incremental progress. This is just an off the top of my head recollection of the complexity inherent in the ā€œjust store the files on diskā€ strategy. I really wish it were that simple.

The hashstore is designed to be very easy to implement with very few failure modes. The line of code count is sitting at around 1500, with 1200 lines of tests. This is at least an order of magnitude less code for bugs to hide in than our current situation or any filesystem.

I can understand that you think I have more important things to work on. I disagree. Iā€™m not a machine that can be assigned any task and do world class work on it (unfortunately), but there are some tasks where I can. I believe this is one of them.

Youā€™re making a lot of the same arguments that were made when I implemented drpc to replace our usage of gRPC. How could one possibly beat out a well tested rpc mechanism by a large company with teams of engineers optimizing it in wide-scale production usage? The answer was by reducing scope to what is necessary, caring and agonizing over every line of code, having a solid understanding of the underlying systems at work, and picking algorithms and data-structures well suited to the task. I believe it will work here too.

edit: I forgot to say, Iā€™m looking forward to your more substantial response. :slight_smile:

12 Likes

The new hashstore seems like a great idea. I can see how that would perform much better.

Also copying a node to a new disk would probably take hours instead of days with thousands of large files instead of millions of small files.

3 Likes

It was not the case. The main reason is the hard limit for the number of levels, thus - the maximum shared space was limited too. And these levels were not optimally used. To do not waste space, you should always share the maximum possible space, even if you do not have it.

Of course they did, and doing right now. There are benchmarks tools in the repo, so you can check it yourself.
If the PoC would work as expected and will be as fast as already measured, it could be a great upgrade in performance independently of used FS.

This may be addressed by a Community developers right now. Accordingly the GitHub stat of downloads, FreeBSD is less popular than Linux or Windows unfortunately.

Of course they do. Did you tried any of a performance tools

It also stores all profiles which you can analyze.

No, it cannot: refs

I hope this wonā€™t be enabled by default and wonā€™t be the only option to store data?