Lately I've been thinking about the next step after folios -- shrinking struct page. In broad strokes, the plan is to turn struct page into struct page { unsigned long memdesc; }; where the bottom 4 bits is a type discriminator [1] and the rest of the word is usually a pointer to a memory descriptor of some type. More details at https://kernelnewbies.org/MatthewWilcox/Memdescs [1] eg Buddy, File Folio, Anon Folio, Slab, Page Table, ... Today the buddy allocator (not the PCP allocator; that is a subject for a different thread) uses very little information in struct page. It preserves page->flags for the node/zone/section, but does not use that information itself. It uses page->buddy_list to store prev/next pointers to other pages of the same order, and it stores the order of the page in page->private. Naively, then, we could allocate a 32-byte memdesc for pages in the buddy allocator. But that turns out to be horrendously complicated because we need to allocate new ones to split buddy pages, leading to a solvable but ugly loop between the slab & page allocators trying to allocate memory in order to allocate memory. I wrote it all out and deleteed it in disgust. It is my strong preference to embed all the information that buddy needs in the struct page. That incentivises us to see what we can trim out and what we can compress in order to shrink the buddy information as much as possible. Here are a range of options for that. First, note that we don't need to preserve page->flags. We can reconstruct the node/zone/section when the page is allocated. We just need to store the next/prev/order. Option 1 struct buddy { unsigned long next; unsigned long prev; }; struct page { union { unsigned long memdesc; struct buddy buddy; }; }; This gives us a 16 byte struct page. The 'next' field has its bottom four bits used for the memdesc type for buddy pages. The bottom four bits of 'prev' can be used to store the order (assuming we don't need to go past order 15 pages). The remaining bits of 'next' and 'prev' can be used to store pointers to struct page, since struct page is aligned to 16 bits. But this doesn't work on 32-bit because struct page would only be eight bytes. We could make it work by allocating two memdesc types to the buddy allocator (eg types 1 and 9) so that it works for merely 8 byte alignment. But I think we have better options ... Option 2 The same data structure, but store the PFN of the next/prev pages instead of the pointer to the struct page. That gives us a lot more bits to play with! On 32-bit, we can use 28 bits to support up to 1TB of memory (theoretically possible with ARM LPAE). But we no longer have a limit of order-15 pages as we know that PFNs are naturally aligned, and so we can use a single bit [2] in prev to record what order the page is. And we have three bits left over! On 64-bit, we have space for 60 bits of PFN which works out to 4096 exabytes of memory (most 64-bit architectures can support PFNs up to about 51 bits). Option 3 We can compress option 2 on many 64-bit systems. For example, my laptop and my phone have less than 2TB of memory. Instead of using a pair of unsigned long, we can encode next/prev/order into a single 8-byte integer: bits meaning 0-3 memdesc type buddy 4 order encoding 5-33 next 34-62 prev 63 unused That is 29 bits, letting us support up to 2TB systems with an 8 byte memdesc. Assuming there's a decent Kconfig option to determine whether it's OK to decline to support memory above 2TB ... It is tempting to see if we can shrink memdesc to 4 bytes on 32-bit systems, but we only get 13 bits for prev & next, limiting us to a 32MB system. If it's worth the Kconfig option, it'll only be a few extra lines of code to support it. Option 4 Instead of using an absolute PFN, use a PFN relative to the base of the zone. That would mean we need one extra zone per 2TB of memory, which would expand the number of zones a little. But we could keep memdesc at 8 bytes, even on the largest machines, which would save us a Kconfig option for gargantuan machines. (what is the practical limit on PFN today? i see that Oracle Exadata 10M can be configured with 3TB of DRAM. presumably anybody playing with CXL has plans for more PFNs than that ...) I'm keeping this document updated at https://kernelnewbies.org/MatthewWilcox/BuddyAllocator so feedback gratefully appreciated (and my thanks to those who provided feedback on earlier drafts) [2] https://kernelnewbies.org/MatthewWilcox/NaturallyAlignedOrder