The buddy allocator in a memdesc world

Matthew Wilcox <willy@xxxxxxxxxxxxx> · Thu, 18 Jan 2024 22:33:04 +0000

Lately I've been thinking about the next step after folios -- shrinking
struct page.  In broad strokes, the plan is to turn struct page into

struct page {
	unsigned long memdesc;
};

where the bottom 4 bits is a type discriminator [1] and the rest of
the word is usually a pointer to a memory descriptor of some type.
More details at https://kernelnewbies.org/MatthewWilcox/Memdescs

[1] eg Buddy, File Folio, Anon Folio, Slab, Page Table, ...

Today the buddy allocator (not the PCP allocator; that is a subject
for a different thread) uses very little information in struct page.
It preserves page->flags for the node/zone/section, but does not use
that information itself.  It uses page->buddy_list to store prev/next
pointers to other pages of the same order, and it stores the order of
the page in page->private.

Naively, then, we could allocate a 32-byte memdesc for pages in the
buddy allocator.  But that turns out to be horrendously complicated
because we need to allocate new ones to split buddy pages, leading to
a solvable but ugly loop between the slab & page allocators trying to
allocate memory in order to allocate memory.  I wrote it all out
and deleteed it in disgust.  It is my strong preference to embed all
the information that buddy needs in the struct page.

That incentivises us to see what we can trim out and what we can compress
in order to shrink the buddy information as much as possible.  Here are
a range of options for that.

First, note that we don't need to preserve page->flags.  We can
reconstruct the node/zone/section when the page is allocated.
We just need to store the next/prev/order.

Option 1

struct buddy {
    unsigned long next;
    unsigned long prev;
};

struct page {
    union {
        unsigned long memdesc;
        struct buddy buddy;
    };
};

This gives us a 16 byte struct page. The 'next' field has its bottom four
bits used for the memdesc type for buddy pages. The bottom four bits of
'prev' can be used to store the order (assuming we don't need to go
past order 15 pages). The remaining bits of 'next' and 'prev' can be
used to store pointers to struct page, since struct page is aligned to
16 bits. But this doesn't work on 32-bit because struct page would only
be eight bytes. We could make it work by allocating two memdesc types
to the buddy allocator (eg types 1 and 9) so that it works for merely
8 byte alignment. But I think we have better options ...

Option 2

The same data structure, but store the PFN of the next/prev pages instead
of the pointer to the struct page. That gives us a lot more bits to
play with! On 32-bit, we can use 28 bits to support up to 1TB of memory
(theoretically possible with ARM LPAE). But we no longer have a limit of
order-15 pages as we know that PFNs are naturally aligned, and so we can
use a single bit [2] in prev to record what order the page is. And we
have three bits left over! On 64-bit, we have space for 60 bits of PFN
which works out to 4096 exabytes of memory (most 64-bit architectures
can support PFNs up to about 51 bits).

Option 3

We can compress option 2 on many 64-bit systems. For example, my laptop
and my phone have less than 2TB of memory. Instead of using a pair of
unsigned long, we can encode next/prev/order into a single 8-byte integer:

bits	meaning
0-3	memdesc type buddy
4	order encoding
5-33	next
34-62	prev
63	unused

That is 29 bits, letting us support up to 2TB systems with an 8 byte
memdesc. Assuming there's a decent Kconfig option to determine whether
it's OK to decline to support memory above 2TB ...

It is tempting to see if we can shrink memdesc to 4 bytes on 32-bit
systems, but we only get 13 bits for prev & next, limiting us to a 32MB
system. If it's worth the Kconfig option, it'll only be a few extra
lines of code to support it.

Option 4

Instead of using an absolute PFN, use a PFN relative to the base of the
zone. That would mean we need one extra zone per 2TB of memory, which
would expand the number of zones a little. But we could keep memdesc
at 8 bytes, even on the largest machines, which would save us a Kconfig
option for gargantuan machines.

(what is the practical limit on PFN today?  i see that Oracle Exadata
10M can be configured with 3TB of DRAM.  presumably anybody playing with
CXL has plans for more PFNs than that ...)

I'm keeping this document updated at
https://kernelnewbies.org/MatthewWilcox/BuddyAllocator
so feedback gratefully appreciated (and my thanks to those who
provided feedback on earlier drafts)

[2] https://kernelnewbies.org/MatthewWilcox/NaturallyAlignedOrder