Re: [LSF/MM/BPF TOPIC] Restricting or migrating unmovable kernel allocations from slow tier

Gregory Price <gourry@xxxxxxxxxx> · Fri, 7 Feb 2025 03:57:45 -0500

On Fri, Feb 07, 2025 at 04:20:24PM +0900, Byungchul Park wrote:
> On Sat, Feb 01, 2025 at 02:04:17PM +0000, Matthew Wilcox wrote:
> 
> We can work with from the easiest object

>e.g. page table

It's more efficient and easier to change page sizes than it is to make
page tables migratable.

It's also easier to reclaim cold pages eating up significantly more
memory than the page table (which describes pages at ~8 bytes per page).

Also, there's quite a bit of literature that shows page tables landing
on remote nodes (cross-socket) has negative performance impacts.

Putting them on CXL makes the problem worse.

> struct page,

`struct page` is a structure that describes a physically addressed page.

It is common to access it by simply doing `pfn_to_page()`, which is a
fairly simply conversion (bit more complex in sparsemem w/ sections)

This is used in a lockless manner to acquire page references all over
the kernel.

Making that migratable is... ambitious, to say the least.

> and kernel stack,

The default kernel stack size is like 16kb.  You'd need like 100,000
threads to eat up 1.5GB, and 2048 threads only eats like 32MB.

It's not an interesting amount of memory if you have a 20TB system.

> When it comes to this topic, the most important thing is the collected
> *direction* from the community so that we can start the work under the
> *direction*.
> 

My thoughts here are that memory tiering is the wrong tool for the
problem you are trying to solve.

Maybe there's a world in which we propose a ZONE_MEMDESC which is
exclusively used for `struct page` for a node. 

At least then you could design CXL capacities *around* that.

~Gregory