“When a new piece comes in, an open and available log file handle is selected and claimed and the piece is written into that log file. Then the log file is flushed to disk and returned. If the piece data is canceled, the log file is rewound to the beginning of that piece and returned.”
What happens if piece A is written to the log, piece B is written to the log, then piece A is canceled? Doesn’t seem like you can rewind to the start of piece A or you’ll lose B, and with enough concurrent activity, you could have this situation on several pieces simultaneously. Seems like you have to mark piece A deleted and leave it in the log file unless it is at EOF; then you can do the rewind trick.
My initial understanding is that there are multiple log files, at least one per file upload in progress. So the worst that can happen is that in case of an interrupted upload, we rewind a single log file without touching other log files accepting other files concurrently.
I’ll chime in and note that this will really only be an improvement if your node is having trouble keeping up with the load. If your node already handles the load easily, then this change would probably alleviate wear on the disk, but things will not get much faster (being network-bound and not io-bound).
On the other hand, if your node is slammed, under a lot of load, this could make a huge difference.
“Claimed” in the description there refers to a lock. The log file will be locked while piece A is being written and released (“returned”) once the piece is committed. Piece B would go in a different log file.
The biggest benefit from my PoV will be the fact that you will be able to meaningfully use the same hardware for non-Storj purposes. I’m thinking of nodes that barely keep up, or non-Storj purposes eating a lot of I/O, leading to Storj nodes being slow and losing races.
I would not want that. I do not see a reason to load the disks to 100% IO for a long time to copy 16TB to the new format. That would just invite data loss.
I skimmed over the proposed change. I don’t have enough knowledge about golang to speak about, let say, types… or the synchronization primitives, though it does feel weird that golang does not provide them out of the box.
I’m not sure about some of the design choices. Please correct me where I’m wrong, as I wrote, this was just a cursory look at the code.
Open addressing together with an attempt to accomodate bad writes leads to this cumbersome process of heuristic lookups. What I have proposed in the past was fixed-size buckets, and while we then need to control the size of each bucket, there would be no need for heuristics.
The target of “just under” 25% load factor for hash table compaction leads to effective memory requirements for the amount of RAM used for caching of 64 bytes / 0.25 = 256 bytes per record after compaction. This is not much better than default ext4, and actually noticeably worse than optimized ext4. Given that the aim is to have this data structure in fast storage, higher target load factor should be acceptable. That’s also why I was fighting for each byte in my proposal: 43 bytes with a load factor of around 40% after fixed bucket-based hash table resize (empirical result from simulations, I couldn’t find any good derivation for this number) is around 108 bytes per record, and we don’t need to rewrite the whole hash table on a daily basis, so on average the effective load factor will be higher (around 56%, again from simulations).
Many log files might need compaction in a single run. I can imagine a case from summer testing where tens of these 10GB log files are all marked for compaction. This will lead to lots and lots of random reads with the current code, because iteration goes roughly in the key order, and not, let say, in the physical order of pieces stored in log files. The result is making the process potentially very time-consuming, and any interruption will require starting from scratch.
The above problem is probably compounded by the two-store approach. If one of the stores is stuck in compaction, and the other fills up its hash table in that time, we can no longer do anything.
Of smaller things, scaling up a hash table seems to lead to lots of random writes into the new hash table. Assume the old hash table was of size N, and the new one will be 2N. You want an item from old slot k to want to go into either slot 2k or 2k+1 (so unless a slot is taken, a single page). This makes hash table rewrite close to sequential write, as going slot by slot will often lead to writing to slots nearby. Instead, in the current proposal it randomly goes to either k or k + N, depending on a single bit of piece ID (so, one of two pages far from each other). A change from rec.index()&h.mask to rec.index() >> (64 - h.lrec) might be enough.
And I’m still not sure about writing to potentially tens of log files concurrently. I see that what is proposed here will not be worse than what filestore does now, but my feeling is that we should be able to do better than that.
On a positive side, this design will make it easy in future to have a separate set of log files for TTL/trashed data. Let say, log files per TTL day—then, in case a customer with similar ingress patterns as the one tested over the summer comes, it will be very quick to expire and remove all these pieces together, by just removing whole log files. I would say this property alone would be a killer feature over using some fancier file systems like zfs…
Thanks for looking it over! This is amazing feedback. I’ll try to respond point by point.
It does provide a bunch of synchronization primitives out of the box, but there’s kinda two worlds that don’t compose well together: mutexes and channels. The main benefit of channels is that you can use select to try to do multiple channel operations simultaneously and pick from the first available one. In my case, I wanted something like an RWMutex that allows for multiple “readers” exclusive to a single “writer”, but I also wanted it to be able to cancel acquiring a lock if the incoming request is canceled, and contexts signal cancellation with a channel, so I had to implement the concept of this mutex using channels so that I could select on acquiring the lock and some reasons the acquisition may be canceled.
I don’t think the lookup heuristic is cumbersome. It is only a single if condition and accounts for a total of 3 lines of code. I also don’t think fixed sized buckets solves it by itself: my understanding is that it also needs to synchronously trigger hashtable growth on bucket overflow because otherwise it would need to do some sort of probing on the bucket level which has the same failure mode as individual record probing (a lost update for some unrelated record could cause an early probe termination). I don’t want to do synchronous resizing to avoid issues with tail latencies, and incremental resizing is a lot of extra complexity. The two store situation is to avoid these problems.
Also, I hadn’t seen your excellent proposal! I just read over it and it’s exciting that we landed on such similar ideas.
That’s the target after compaction. The average load factor is going to be somewhere between 0.25 and 0.7 (the compaction threshold) which should be around 0.475 or so, giving an average of ~134 bytes: not much worse. I just checked a test node that has about 2 million pieces on it, and it has a load factor of approximately 0.4214.
It is difficult to get the record much smaller with my design because I want them to be a multiple of the page size (4096), and so the only value smaller than 64 that works is 32, and the piece id is 32 bytes already. I could just accept not having the full piece id (16 bytes would be enough to guarantee no collisions, I think), but I don’t think I could also halve the rest of the metadata.
The reason for the daily rewrite is to do TTL and garbage collection even if the node is not getting new writes (if it’s full, for example). I think any design would have to do this.
This is a great point. Here’s a couple of unorganized thoughts:
We can easily limit the number of log files compacted at once to make it more incremental
We could delay writing out the moved pieces until after the new hash table is written maybe?
This would be some random writes into the hash table, but it could be tricky and pre-insert their entries with computed locations (this spooks me and seems fragile, though)
Log files are fairly naturally iterated over in reverse due to the records appended for reconstruction. Maybe reverse-linear is better than random, and would make it a bit easier.
Looking at the monitoring stats of that test node, it spent about 13% of its uptime compacting. I don’t think this value weighs strongly in either the “it needs to be faster” or “it doesn’t matter” directions, unfortunately.
This is by design. If the db is unable to keep up with writes, we need some form of backpressure. I think this should be rare, but I didn’t have any monitoring around the call that indicates when it happens, so I don’t know yet.
Great point and great idea. I wanted to ensure that the expanded hash table would be mostly linear inserts, but I messed up the details. I have taken your suggestion and I’m going to add a test to ensure it.
I also share concern about a large number of concurrent log files being written to, but I don’t know how to do it better.
Yep, the best part about this idea is that it’s completely backwards compatible: we can just start trying to keep data that expires on the same day in the same log files, and it will fall out naturally. It also doesn’t need to do a perfect job to get the benefits (a file that is 99% expired is about 99% as good as a file that expires 100% as you only need to rewrite 1% of the data) which increases implementation flexibility.
We’ve also had thoughts to make compaction smarter about choosing which log files to compact by perhaps waiting to compact if the log contains a bunch of data that will expire soon. There’s a bunch of small improvements like that remaining.
Indeed synchronous hashtable growth is part of my proposal. I don’t think it would affect tail latencies in any significant way in my proposal though. First, hashtable growth is decoupled from data rewrite, so even growing a 2 GiB hashtable to 4 GiB, the pause only involves a sequential read of 2 GiB and a sequential write of 4 GiB when the node hits ~24 TB of storage at current piece sizes and Reed-Solomin’s k=29. Even on HDDs this would mean a pause of maybe a minute. Second, it would happen very rarely: dead records are removed as part of regular bucket writes, you don’t need to wait until a full hashtable rewrite to get rid of dead records like with open addressing. So the hashtable grows only when it gets truly filled up—which will be less than once a month. A pause of a minute once a month should be acceptable.
Thank you! Though, I admit it is discouraging to work on a proposal like that and then see it not even remembered.
As far as I understand the code, it will attempt to resize the hash table every other day. I don’t see it reaching much beyond 0.5, because the moment it crosses this threshold, the next compaction will find that the desired lrec grows. As such, this proposal will bounce between 0.25 and 0.5, while mine would jump from ~0.4 and ~0.7.
Is there any obvious benefit for having record size an exact divisor of the page size?
Now I’ve noticed one more problem here: we might simply run out of disk space rewriting many big segments.
Sounds reasonable.
So, decoupling hashtable maintenance from log file maintenance? Makes sense too.
Yep, scary.
That’s a lot if you want this being run on hardware also used for other purposes.
In any case, I would consider scanning the hash table for pieces belonging to a single log file (only ~60k with the current average piece size), simply sort them by offset, and use that order to write a new log file. Then, sort them back by key, and update the hash table in the key order in hope that it will be considered a sequentual write by the OS.
Moved this one to the end of my reply. In a potential corner case where the table fills up to the degree where an insert results in taking a slot more than a page away, and a page somewhere between the desired slot and the actual slot then is hit by a bad write, we might end up with a lost record due to an early return in the Lookup function.
Not all team members did read your proposal, but I read a feedback when I shared it and recently, when your proposal was again bring up on the table of storagenode’s performance improvements. It might also be my failure, because I was sure that this hash table implementation was inspired by your suggestion, so I didn’t add it again into the discussion.
This sounds bad. 13%, so about 3 hours every day reading old data and writing it in new locations. With the current implementation, the node does a bunch of reads and deletes (trash collection etc). At least to me the current implementation looks better as there are fewer writes (and almost no writes if there is no ingress). The only annoyance is the used space filewalker, but it can be disabled and run only sometimes.
As I understand, garbage collector and trash primarily does reads and deletes. This would do writes. If it replaces “a lot” of read/delete IO with “a lot or slightly less” read/write IO it would be worse IMO.
Wouldn’t the writes be bad for SMR drives?
I also am not convinced about why such redesign is needed to make garbage collection simpler. Probably the only thing that’s needed is a database of all piece IDs and their expiration dates (or whatever other information is used).
But what if the database gets out of sync with the actual files or gets corrupted? Well, the same can be said about the new method. So, you make the database update in sync with the file writes and that is done with the new method.
Writing the pieces in big files that need periodic rewriting as pieces are deleted is the optional thing here.
For example - take the new method, but make each “big” file exactly one piece big, how would that affect everything?
I think the best test is a real life use. So keep impruving the code and, when it’s finished, let us know to test it. Just don’t make it default. Some may like it, some may not. If after some months of production testing, all is well, maybe we can agree to make it default.
The past shows us many defaults, that were not tested enough, and proved to be problematic in production.
hashstore (this proposal, taking the name from the golang package name used in this proposed change) trades off a good amount of random writes for more sequential writes. This, coupled with a zone-aware file system, should be a lot better for SMR drives.
It was definitely remembered. Like @Alexey, I also thought @zeebo’s design was at least partially informed by your proposal. @zeebo doesn’t typically do a lot of work on the storage node specifically, or he probably would have been aware of it also. It seems he just saw a need and came up with a proposed solution.
You may be right that in the steady state most compactions will be due to age rather than writes, making the average load factor higher.
A couple of things. Avoids torn writes for records that span sectors or filesystem blocks, etc. No wasted bytes at the end of pages if wanting to maintain page alignment.
The current value is closer to 7%, and I expect it to continue to drop. Additionally, this is before fixes to make hash table growth write linearly as you identified. This node has also accepted about 6MB/sec of piece data (about 50 pieces/sec) on average. Again, I don’t know what these values really mean or indicate yet. It’s all very early numbers in a single situation.
Hmm. I think in the worst case for a full log file with 0 byte pieces it could be about ~500MB of ram to sort them all by offset. This obviously can’t happen, but it’s good that it’s not too expensive. I think this would mean effectively writing the hash table twice: once for all of the unmoved pieces, and once for all of the moved pieces, and it’s unclear to me if that’s worth avoiding the random reads.
Yep, this is indeed why it scans some extra pages for a missed lookup. With very high probability scanning at least 2 pages will find it, is very easy to code and test, and costs ~nothing in the common case of no invalid records because we almost never ask for pieces that don’t exist.
Deletes are writes, but small ones. Deleting a 1GB file will generate very few writes to the filesystem. Combining a bunch of smaller files into a 1GB file will generate 1GB of writes to combine the files, then some more writes to delete the old ones.
That’s after the initial deletes of the pieces generated writes to hole punch the large files.
It seems to me that this will increase the amount of writes.
I see. Still, if you could reduce the size of a record to, let say, 50 bytes, and align to 4kB pages to avoid torn writes, even if you are wasting 46 bytes at the end of each sector, then to fill to a full sector, you’re still storing 25% records more per sector, and therefore reduce size of the hash table by 20%.
Honestly I was thinking of compacting just one log file at a time, then the amount of memory needed is negligible.
I’m just worried about a corner case where we accidentally allow the hash table to almost totally fill up (only single free slots here and there). Then a piece might end up taking a slot far away from the desired one, and a page lost would result in the heuristic not reach far enough to recover it.
Given that I/O troubles tend to be correlated (maybe it filled up because the other hash table was being compacted, took lots of I/O to copy log file data; and the same heavy I/O also resulted in a lost write… let say the node got killed before the page got written because the same high I/O prevented the node from clearing uploads fast enough.
Even though this sounds like something that shouldn’t happen on reliable hardware, the goal of this exercise is pushing the boundary on how cheap we can get with hardware.
In any case, my understanding is that everything so far looks very promising, and hopefully nobody sees my criticism here as a lack of support for the proposal. I think by refining some details we might end up with faster and quite a bit more reliable network.