Looks like some improvements are in the works

I skimmed over the proposed change. I don’t have enough knowledge about golang to speak about, let say, types… or the synchronization primitives, though it does feel weird that golang does not provide them out of the box.

I’m not sure about some of the design choices. Please correct me where I’m wrong, as I wrote, this was just a cursory look at the code.

  1. Open addressing together with an attempt to accomodate bad writes leads to this cumbersome process of heuristic lookups. What I have proposed in the past was fixed-size buckets, and while we then need to control the size of each bucket, there would be no need for heuristics.
  2. The target of “just under” 25% load factor for hash table compaction leads to effective memory requirements for the amount of RAM used for caching of 64 bytes / 0.25 = 256 bytes per record after compaction. This is not much better than default ext4, and actually noticeably worse than optimized ext4. Given that the aim is to have this data structure in fast storage, higher target load factor should be acceptable. That’s also why I was fighting for each byte in my proposal: 43 bytes with a load factor of around 40% after fixed bucket-based hash table resize (empirical result from simulations, I couldn’t find any good derivation for this number) is around 108 bytes per record, and we don’t need to rewrite the whole hash table on a daily basis, so on average the effective load factor will be higher (around 56%, again from simulations).
  3. Many log files might need compaction in a single run. I can imagine a case from summer testing where tens of these 10GB log files are all marked for compaction. This will lead to lots and lots of random reads with the current code, because iteration goes roughly in the key order, and not, let say, in the physical order of pieces stored in log files. The result is making the process potentially very time-consuming, and any interruption will require starting from scratch.
  4. The above problem is probably compounded by the two-store approach. If one of the stores is stuck in compaction, and the other fills up its hash table in that time, we can no longer do anything.
  5. Of smaller things, scaling up a hash table seems to lead to lots of random writes into the new hash table. Assume the old hash table was of size N, and the new one will be 2N. You want an item from old slot k to want to go into either slot 2k or 2k+1 (so unless a slot is taken, a single page). This makes hash table rewrite close to sequential write, as going slot by slot will often lead to writing to slots nearby. Instead, in the current proposal it randomly goes to either k or k + N, depending on a single bit of piece ID (so, one of two pages far from each other). A change from rec.index()&h.mask to rec.index() >> (64 - h.lrec) might be enough.
  6. And I’m still not sure about writing to potentially tens of log files concurrently. I see that what is proposed here will not be worse than what filestore does now, but my feeling is that we should be able to do better than that.

On a positive side, this design will make it easy in future to have a separate set of log files for TTL/trashed data. Let say, log files per TTL day—then, in case a customer with similar ingress patterns as the one tested over the summer comes, it will be very quick to expire and remove all these pieces together, by just removing whole log files. I would say this property alone would be a killer feature over using some fancier file systems like zfs…

7 Likes