Re: [LSF/MM/BPF TOPIC] Restricting or migrating unmovable kernel allocations from slow tier

Byungchul Park <byungchul@xxxxxx> · Fri, 7 Feb 2025 19:14:49 +0900

On Fri, Feb 07, 2025 at 03:57:45AM -0500, Gregory Price wrote:
> On Fri, Feb 07, 2025 at 04:20:24PM +0900, Byungchul Park wrote:
> > On Sat, Feb 01, 2025 at 02:04:17PM +0000, Matthew Wilcox wrote:
> > 
> > We can work with from the easiest object
> 
> >e.g. page table
> 
> It's more efficient and easier to change page sizes than it is to make
> page tables migratable.

You are misunderstanding.  I didn't say 'do not change page sizes'.  I
didn't say it's easier than changing page size.  I said *both* changing
page sizes and making them migratable could reduce ZONE_NORMAL cost.

> It's also easier to reclaim cold pages eating up significantly more
> memory than the page table (which describes pages at ~8 bytes per page).

Same.  We should keep reclaiming cold pages eating up memory.  Why do we
give up reclaiming cold pages if page table becomes migratable?  I
really don't understand why you are trying to exclusively pick up only
one effort for that purpose.

> Also, there's quite a bit of literature that shows page tables landing
> on remote nodes (cross-socket) has negative performance impacts.

Exactly.  That's the motivation to suggest this topic.  That's why we
are asking about kernel object migratibility.  Of course, we try our
best to place kernel object in DRAM in the first place.  However, the
thing would arise when it becomes impossible.  It's about comparison
between 'premature reclaim and die(= oom)' and 'slight degradation of
performance'.

> Putting them on CXL makes the problem worse.

No.  Higher chance to die is worse.

> > struct page,
> 
> `struct page` is a structure that describes a physically addressed page.
> 
> It is common to access it by simply doing `pfn_to_page()`, which is a
> fairly simply conversion (bit more complex in sparsemem w/ sections)
> 
> This is used in a lockless manner to acquire page references all over
> the kernel.
> 
> Making that migratable is... ambitious, to say the least.

Yes.  I don't think it's easy.

> > and kernel stack,
> 
> The default kernel stack size is like 16kb.  You'd need like 100,000
> threads to eat up 1.5GB, and 2048 threads only eats like 32MB.
> 
> It's not an interesting amount of memory if you have a 20TB system.

Kernel stack is an example.  We can skip it and look for better
candidate.

> > When it comes to this topic, the most important thing is the collected
> > *direction* from the community so that we can start the work under the
> > *direction*.
> > 
> 
> My thoughts here are that memory tiering is the wrong tool for the
> problem you are trying to solve.

I think any valid efforts can be considered at the same time.  Is there
any reason that effort in tiering environment should be excluded?

	Byungchul

> Maybe there's a world in which we propose a ZONE_MEMDESC which is
> exclusively used for `struct page` for a node. 
> 
> At least then you could design CXL capacities *around* that.
> 
> ~Gregory