Copying node data is abysmally slow

Half of the uploaded files are at or below 4kiB. ext4 pages are 4kiB (mostly relevant for storage of inodes and directory entries). IIRC sqlite uses 8kiB pages. The difference between 4kiB and 16kiB write on HDDs is pretty small, the seek time dominates, so up to 16kiB it’s more meaningful to talk about IOPS than bandwidth.

Just to show the complexity, at least the parts I am aware of, not claiming to be an expert in file system design. For a typical ext4 file system with an ordered journal, each upload involves following operations:

  • Creation of an inode for the file (4kiB write + another write for the ext4 journal entry).
  • Write of the file contents (size of a piece + 512 bytes for a header, rounded up to 4kiB). This is a mostly sequential write, likely not fragmented, so I assume the extent tree fits in the inode itself (best case).
  • Write of a directory entry in the temp/ directory (4kiB write + another write for the ext4 journal).
  • sync(), forcing all of the above onto the drive. Here the journal, inodes, directory entries and file contents are unlikely to be placed next to each other, as the journal and inodes are preallocated, and the directory will probably already be allocated somewhere too.
  • Rename from temp/ to actual storage directory, involving 2×4kiB directory entry writes (one removes from temp, another creates it in another place) + an 8kiB journal write for both sides. In case of some satellites, their storage directories are big enough to use an h-tree, making the update potentially require several page writes, but let’s consider the optimistic case here.
  • Update of the bandwidth.db database, so probably two 8kiB file writes (the transaction log and the main database file) + maybe extending the log file, so an update to the 4kiB transaction log file inode + maybe a 4kiB update to the file’s extent tree. Not sure though if they’re synced, this database is not that vital.
  • Update of the orders file; again, a file write, though this one is not synced, so it’s likely multiple uploads will be coalesced into a single write here.

Here I assume the directory entries are already cached, so they don’t need to be read prior to writing. temp/ is invoked often enough, I suspect other directories as well as long as the machine has enough RAM.

I’m counting 10 write operations in the optimistic case, suspecting it may go up to around 20 seeks (reads and writes) in many cases. Some might not be synced, allowing them to be merged across multiple uploads (but not during a single upload!). Except for one they’ll likely all be 4kiB or 8kiB. There were some optimizations for the journal in recent kernels, making the journal writes smaller and potentially coalescable.

Other than that, any file system that coalesces other writes (like, maybe directory writes with inode updates) will fragment data structures (like, directory entries) so that reads become slow. Slow reads will mean slow file walker, and we’ve seen reports of the file walker taking >24h. So sometimes coalescing writes is actually not desirable.

1 Like