I’m fine with fragmentation! But when reading your paper, it seems to have quite a bit of focus on not fragmenting, by doing pre-allocation, only extending files smaller than 128MB, saying writes will be sequential (on the media), etc.
If over the long haul your design devolves into fragmented pack files, then I think you lose your expected 5-10x performance advantage over using the file system for pieces:
a) for reads, you still have to read the piece index and read the piece from the pack file. This may save you one seek because your directory and inode data are combined, but the OS is more likely IMO to keep designated directory and inode buffers cached vs piece index buffers, so it’s not clear without simulations which will fare better.
Pieces do not become fragmented over time in your design, but the same is true if pieces are stored as individual files. (As an aside, I just realized that if pieces are today stored as individual files, there already is an average of 2K wasted per piece, so padding pieces to 4K for DIO with your design is equivalent.)
b) for writes, things start out nice and sequential (on media), but eventually it appears pack files will become fragmented anyway.
Here’s an outlandish idea: have you considered having one pre-allocated file with all the piece data and managing free space within it yourself? If you use O_DIO, you can read and write pieces directly, avoiding the buffer cache and RMW. You’d have to maintain a free space bit map, where each bit represents 4K.
When deciding where to start writing pieces, find the biggest hole and start there. When it’s full, repeat. Since piece I/O doesn’t have any locality of reference, it doesn’t matter where you write pieces: it’s always going to take a random read to get them.
This would also let you do incremental compacting. You could move pieces one-at-a-time or move large blocks of pieces. You’d have a lot of flexibility on the implementation of this, allowing to optimize over a wide range of throughput or latency.