On Tue, May 09, 2023 at 11:48:49PM -0700, Eric Biggers wrote: > What seems to be missing is any explanation for what we're actually getting from > this extremely unusual solution that cannot be gained any other way. What is > unique about bcachefs that it really needs something like this? Ok, as promised: Background: all metadata in bcachefs is a structured as key/value pairs, and there's a common key format for all keys. struct bkey { /* 3 byte header */ u8 u64s; /* size of k/v in u64s */ u8 format; /* packed/unpacked, needs_whiteout */ u8 type; /* value type */ u8 pad; /* * Order of fields below is for little endian, they're in * reverse order on big endian (and byte swabbed as necessary * when reading foreign endian metadata) * * Since field order matches byte order, the key can be treated * as one large multi word integer for doing comparisons: */ u96 version; /* nonces, send/recv support */ u32 size; /* size of extent keys */ /* Below are the field used for ordering/comparison: */ u32 snapshot; u64 offset; u64 inode; /* Value is stored inline with key */ struct bch_val v; }; sizeof(struct bkey) == 40. An extent value that has one pointer and no checksum is 8 bytes, with one pointer and one 32 bit checksum 16 bytes, for 56 bytes total (key included). But for a given btree node, most of the key fields will typically be redundandant. An extents leaf node might have extents for all one inode number or a small range of inode numbers, snapshots may or may not be in use, etc. - clearly some compression is desirable here. The key observation is that key compression is possible if we have a compression function that preserves order, and an order-preserving compression function is possible if it's allowed to fail. That means we do comparisons on packed keys, which lets us skip _most_ unpack operations, for btree node resorts and for lookups within a node. Packing works by defining a format with an offset and a bit width for each field, so e.g. if all keys in a btree node have the same inode number the packed format can specify that inode number and then a field width of 0 bits for the inode field. Continuing the arithmetic from before, a packed extent key will typically be only 8 or 16 bytes, or 24-32 including the val, which means bkey packing cuts our metadata size roughly in half. (It also makes our key format somewhat self describing and gives us a mechanism by which we could add or extend fields in the future). ----------------------------------------------------- As mentioned before, since packed bkeys are still multi-word integers we can do some important operations without unpacking, but to iterate over keys, compare packed & unpacked keys in resort, etc. - we'll still need to unpack, so we need this operation to be as fast as possible. bkey.c __bch2_bkey_unpack_key() is the unspecialized version written in C, that works on any archictecture. It loops over the fields in a bkey_format, pulling them out of the input words and adding back the field offsets. It's got the absolute minimum number of branches - one per field, when deciding to advance to the next input word - but it can't be branchless and it's a whole ton of shifts and bitops. dynamic codegen lets us produce unpack functions that are fully branchless and _much_ smaller. For any given btree node we'll have a format where multiple fields have 0 field with - i.e. those fields are always constants. That code goes away, and also if the format can be byte aligned we can eliminate shifts and bitopts. Code size for the dynamically compiled unpack functions is roughly 10% that of the unspecialized C version. I hope that addresses some of the "what is this even for" questions :) Cheers, Kent