Re: [PATCH 07/32] mm: Bring back vmalloc_exec

Kent Overstreet <kent.overstreet@xxxxxxxxx> · Fri, 12 May 2023 14:36:13 -0400

On Tue, May 09, 2023 at 11:48:49PM -0700, Eric Biggers wrote:
> What seems to be missing is any explanation for what we're actually getting from
> this extremely unusual solution that cannot be gained any other way.  What is
> unique about bcachefs that it really needs something like this?

Ok, as promised:

Background: all metadata in bcachefs is a structured as key/value pairs,
and there's a common key format for all keys.

struct bkey {
	/* 3 byte header */
	u8		u64s;		/* size of k/v in u64s */
	u8		format;		/* packed/unpacked, needs_whiteout */
	u8		type;		/* value type */
	u8		pad;

	/*
	 * Order of fields below is for little endian, they're in
	 * reverse order on big endian (and byte swabbed as necessary
	 * when reading foreign endian metadata)
	 * 
	 * Since field order matches byte order, the key can be treated
	 * as one large multi word integer for doing comparisons:
	 */
	u96		version;	/* nonces, send/recv support */
	u32		size;		/* size of extent keys */

	/* Below are the field used for ordering/comparison: */
	u32		snapshot;	
	u64		offset;
	u64		inode;

	/* Value is stored inline with key */
	struct bch_val	v;
};

sizeof(struct bkey) == 40.

An extent value that has one pointer and no checksum is 8 bytes, with
one pointer and one 32 bit checksum 16 bytes, for 56 bytes total (key
included).

But for a given btree node, most of the key fields will typically be
redundandant. An extents leaf node might have extents for all one inode
number or a small range of inode numbers, snapshots may or may not be in
use, etc. - clearly some compression is desirable here.

The key observation is that key compression is possible if we have a
compression function that preserves order, and an order-preserving
compression function is possible if it's allowed to fail. That means we
do comparisons on packed keys, which lets us skip _most_ unpack
operations, for btree node resorts and for lookups within a node.

Packing works by defining a format with an offset and a bit width for
each field, so e.g. if all keys in a btree node have the same inode
number the packed format can specify that inode number and then a field
width of 0 bits for the inode field.

Continuing the arithmetic from before, a packed extent key will
typically be only 8 or 16 bytes, or 24-32 including the val, which means
bkey packing cuts our metadata size roughly in half.

(It also makes our key format somewhat self describing and gives us a
mechanism by which we could add or extend fields in the future).

-----------------------------------------------------

As mentioned before, since packed bkeys are still multi-word integers we
can do some important operations without unpacking, but to iterate over
keys, compare packed & unpacked keys in resort, etc. - we'll still need
to unpack, so we need this operation to be as fast as possible.

bkey.c __bch2_bkey_unpack_key() is the unspecialized version written in
C, that works on any archictecture. It loops over the fields in a
bkey_format, pulling them out of the input words and adding back the
field offsets. It's got the absolute minimum number of branches - one
per field, when deciding to advance to the next input word - but it
can't be branchless and it's a whole ton of shifts and bitops.

dynamic codegen lets us produce unpack functions that are fully
branchless and _much_ smaller. For any given btree node we'll have a
format where multiple fields have 0 field with - i.e. those fields are
always constants. That code goes away, and also if the format can be
byte aligned we can eliminate shifts and bitopts. Code size for the
dynamically compiled unpack functions is roughly 10% that of the
unspecialized C version.

I hope that addresses some of the "what is this even for" questions :)

Cheers,
Kent