In theory it is possible to write a filesystem specifically for Storj that performs better than a general filesystem. For example, storj-fs does not need the ability to modify the file (piece) - just create and delete it, no need for permissions or some other features that other filesystems have.
Is it worth it? I do not know. Maybe it would help improve performance for the slower nodes , maybe it will end up like Storjv2 (much worse than using files). The way it is described, it seems like it would work better on SMR drives, but I have no idea how it would work in practice.
A bad sector in data drive means one corrupted file and one possibly failed audit, the node can still run. A bad sector in the metadata drive would probably be much worse, at least from my experience of a hard drive getting a bad sector in the MFT (NTFS).
Hello! Iām the one who designed and implemented the hashstore that was linked. Iām going to try to respond to some of the critiques in this thread. I honestly do appreciate the discussion and I believe the best ideas are forged in the fires of open discourse.
One of the main arguments Iāve seen presented is along the lines of āthe filesystem is designed for this and has much more investment, so how can you possibly do better?ā The main reason I believe so is that filesystems have to solve a much harder set of problems. For example,
They must present a file hierarchy, which naturally leads to a tree like structure necessitating at best O(log n) access times.
The file contents and structure are mutable. This enormously complicates the solution space.
They must comply with POSIX semantics, or something similar.
For the piecestore, we donāt have to do any of those things. We have immutable, write once data in a flat namespace that doesnāt need any ordering guarantees. A hash table would be the perfect data structure, providing O(1) reads and writes, which matters when you have a large number of files (hundreds of millions).
The other part of the design that is important is that currently we have many background processes that have to walk over all of the pieces and do work. Specifically, garbage collection and used space determinations come to mind. The hashstore unifies all of the operations that have to walk the pieces into the growth/shrinking of the disk backed hash table, where each record for a piece fits into 64 bytes. This is much smaller than the internal data structures for inodes for any filesystem, and the reads/writes happen sequentially. This will drastically decrease the amount of iops necessary for the node to do itās background processing work, and cuts out many thousands of lines of code.
For what itās worth, I agree that all of the caches, badger, the TTL databases, and other ancillary structures bolted on to attempt to make operation of the node performant are not the best, but they were added for good reasons: people found performance lacking, and the easiest step to take was to add a little system to make it faster. Do that enough times and youāre stuck in a local optimum. This is an attempt to move to a better local optimum than we currently have.
As I understand, this could be solved by having a database with the pieces, their size etc somewhere so there would be no need to periodically walk over all files.
All in all, I do not know how the new method would perform, especially on my system, maybe it would be better than plain ext4.
How are you going to deal with block sizes? Typically ext4 block is 4K, on NTFS the cluster can be up to 64K. Reading/writing unaligned data would be slower, but aligning it would waste some space.
This is exactly the thing that the hashstore is trying to avoid. You still end up needing to periodically walk the pieces in case the database becomes out of sync, or you could state that the database is the source of truth, and that design is effectively the hashstore: a central index of metadata about the pieces, and a set of files containing the piece data.
The first obvious optimization is to combine pieces into larger files so the underlying filesystem has to do less work in its index structure (the inodes). Then, every database will do O(log n) reads/writes whereas a hash table is O(1). The reason why normal databases donāt use a hash table is they have to solve a harder problem that requires things like ordering guarantees, etc.
Finally, because weāre implementing the hash table ourselves, we get to hook into the background chores where the table needs to grow/shrink to simultaneously do garbage collection and TTL processing, which minimizes the number of iops and greatly simplifies the implementation:
Thereās no need for the hash table to support deletes, so no tombstones, etc. The hash table becomes append only, which is much simpler than any database.
There are no updates necessary to flag pieces as trash or removed, reducing iops
It minimizes the amount of data read in that there are no queries to find pieces that are TTL expired, etc.
Iām not. As argued above, filesystems are really good and Iām relying on them to do a good job reading byte ranges out of files.
Filesystems are good at doing that for separate files, but in this case they donāt see separate files, just one big file. So, there is no reason for the filesystem to optimize writes so that certain variable-length pieces of the file can be read faster. What may end up happening is that if there is a bunch of small pieces stored that are unaligned (letās say the block size is 4K, first piece on the large file is 7K and all the others are 8K) reading them will cause a longer read that would take more time (in this example, reading 12K).
I do not know how uch this would actually affect performance, but for larger block sizes (NTFS can get up to 64K) this may be significant.
Also, on a filesystem with 4K blocks, appending, say, a 15K file would cause read-modify-write on the (currently) last block. That may also reduce the performance.
Wouldnāt you still need to do the O(log n) read to open the large file in the first place. GET requests for pieces will likely be randomly distributed and not concentrated in the same large file.
I think the only way this could be true is if the filesystem is internally padding in the same way that it would have to be padded in the hashstoreās log files. I hadnāt considered this, but Iām unconvinced that the benefits from aligning reads outweighs the advantages of a more compact layout: better space utilization, perhaps more cache hits, etc.
That said, it wouldnāt be too hard to āoptimistically padā where if you only need to append a few bytes to get back to 4k alignment (or whatever), it does so. Iāll ponder this more.
All of the files containing piece data are always held open by the process, so no directory traversal has to happen. Iām presuming that reading a random offset into a file is O(1) iops, so a get will do an O(1) read from the hash table, then an O(1) read from the log file that is indicated as compared to an O(log n) directory traversal to open a file handle for the piece and an O(1) read for the piece.
So happy to see this work! Iām having less and less time for playing with side projects, but I hope Iāll find a second to look at the code.
The commit message does not state so, but if I may recommend something, designing for a potential recovery of the hash table file from log files would be nice. Like, if you can somehow detect that the hash table file is corrupted (e.g. due to an unclean shutdown), it should be possible to recover the hash table file (let say, without the trash flash) by scanning the log files themselvesā¦ the relevant data is likely recoverable from the piece header. Even if it will be a long process, itās still better than starting from scratch.
One design decision of the hash table is that records inside of it are either valid or invalid as determined by a checksum. Specifically, there is no such thing as an āemptyā record: their checksum will be invalid. This means that corruption in the hash table is localized to only the pieces that were affected. The read code path attempts to be resilient to invalid records in the probe sequence by ensuring it reads a few pages before reporting an entry is not in the table. This is another advantage over other more complicated index structures that have more catastrophic failure modes.
This depends on how fragmented the file is. Modern file system usually have an extents tree to store information on location of file parts. We should expect this tree to have a small number of nodes if it is sequentially written, but, let say, hole punching or writing to multiple files concurrently may prevent that and allocate sectors alternating between files. Then the tree might grow.
Obviously exact implementation depends on the file system and operating systemā¦ ext4 puts some attempt to not allocate sectors right after the last file extent to other files exactly to support the sequential writes case, but I donāt know how good the result is.
Iām more thinking about a case where an interrupted write erases a large chunk of a hash table file. If itās possible, maybe the code should check checksums of all entries at startup?
Indeed, but I expect that due to the relatively small number of log files, and that they will have active open file handles, that these trees will be in memory and not have to be read from disk to determine where to start reading. Even in the worst case where it does need to read from disk, Iād expect a good filesystem to succinctly store this information so that it wouldnāt need to read much from disk to get the info.
I picked a number out of a hat and got 10GB. Youād like it to be large enough that you donāt have too many files, but small enough to make rewriting not super expensive and blast radius of a lost file smaller. On a 10TB drive, this will only be about 1k open files which seems reasonable to me.
In any case, you canāt avoid having O(logN) somewhere. Either itās an extent tree, or the btree/htree storing directory entries, or the multilevel directory structure. What you can hope for is that all these data structures are small enough to be kept in RAM cacheāand with this design I do find this likely to happen, so thatās great.
I donāt think it has to go somewhere. Given the log files have a maximum size, and an extent has a minimum size it allocates, that puts an upper bound on the amount of extent metadata. That makes seeking O(1) iops in the size of the file. Additionally the ns being talked about are a different. The O(log n) when talking about the inode/tree traversal is talking about the number of pieces, and seeking into a file obviously doesnāt scale with the number of pieces, making it O(1) in the number of pieces.
All that said, I think weāre mostly in agreement. Iām trying to use O-notation to describe algorithmic differences in the layout: a hash table + bounded seek compared to a tree traversal, and as you noted the real thing that matters is how many iops happen in practice.
Iām using some old 128gb ssds and the TBWs are very high. I think they wonāt make it. So the probability that the high io ssd dies first is much higher than the low io HDD.
If the concern is endurance related wear, and you wear out SSDs at the same rate ā mirroring them is 100% pointless. Mirror helps guard against uncorrelated failures.
I would throw away any consumer SSd and spend $10 to get enterprise one, used, of the same capacity regardless of wear for two reasons:
well designed SSDs fail readonly. Itās 100% harmless ā metadata will just continue being written to main array, and you will replace SSD when that happens, no hurry
Bingo! This is what I about to write, so I wonāt, and instead continue the thought:
We are in the best case trading smaller memory footprint (all those structures have to live in ram ā otherwise we are back to 10ms latency on every access) for increase in code complexity and maintenance work in the future.
Option one: do nothing. Next generation hardware will have more ram. No new code means no new bugs and no new chores. The problem will solve itself by us doing nothing.
Option two: write new code. This will lowers overall stability, distract dev efforts fform
Other tasks and will need to be maintained, debugged, supported, forever.
Iād choose āsit on my hands and waitā to ādo extra work + create extra work for the future for dubious benefitā. I would never create solution with indefinite maintenance requirements for a temporary problem.
Iāll address the rest of @zeebo points separately later tonight as they deserve more substantial response than on the go ranting can afford.
This will overall be less code and complexity when you account for all of the extra data structures and caches necessary for the ājust store the files on diskā strategy to work. We need to do garbage collection, TTL cleanup, trash + restore management, disk space usage estimation/accounting, and probably more. These things became a big enough problem that we have code to run some of them in a separate process with a lower I/O priority, and that persists its state so that it can make incremental progress. This is just an off the top of my head recollection of the complexity inherent in the ājust store the files on diskā strategy. I really wish it were that simple.
The hashstore is designed to be very easy to implement with very few failure modes. The line of code count is sitting at around 1500, with 1200 lines of tests. This is at least an order of magnitude less code for bugs to hide in than our current situation or any filesystem.
I can understand that you think I have more important things to work on. I disagree. Iām not a machine that can be assigned any task and do world class work on it (unfortunately), but there are some tasks where I can. I believe this is one of them.
Youāre making a lot of the same arguments that were made when I implemented drpc to replace our usage of gRPC. How could one possibly beat out a well tested rpc mechanism by a large company with teams of engineers optimizing it in wide-scale production usage? The answer was by reducing scope to what is necessary, caring and agonizing over every line of code, having a solid understanding of the underlying systems at work, and picking algorithms and data-structures well suited to the task. I believe it will work here too.
edit: I forgot to say, Iām looking forward to your more substantial response.
It was not the case. The main reason is the hard limit for the number of levels, thus - the maximum shared space was limited too. And these levels were not optimally used. To do not waste space, you should always share the maximum possible space, even if you do not have it.
Of course they did, and doing right now. There are benchmarks tools in the repo, so you can check it yourself.
If the PoC would work as expected and will be as fast as already measured, it could be a great upgrade in performance independently of used FS.
This may be addressed by a Community developers right now. Accordingly the GitHub stat of downloads, FreeBSD is less popular than Linux or Windows unfortunately.
Of course they do. Did you tried any of a performance tools
It also stores all profiles which you can analyze.