Re: Pmemfs/guestmemfs discussion recap and open questions

Frank van der Linden <fvdl@xxxxxxxxxx> · Wed, 30 Oct 2024 15:43:57 -0700

On Tue, Oct 29, 2024 at 9:39 AM Mike Rapoport <rppt@xxxxxxxxxx> wrote:
>
> Hi David,
>
> On Fri, Oct 25, 2024 at 11:07:27PM -0700, David Rientjes wrote:
> > On Wed, 16 Oct 2024, David Rientjes wrote:
> >
> > > ----->o-----
> > > My takeaway: based on the feedback that was provided in the discussion:
> > >
> > >  - we need an allocator abstraction for persistent memory that can return
> > >    memory with various characteristics: 1GB or not, kernel direct map or
> > >    not, HVO or not, etc.
> > >
> > >  - built on top of that, we need the ability to carve out very large
> > >    ranges of memory (cloud provider use case) with NUMA awareness on the
> > >    kernel command line
> > >
> >
> > Following up on this, I think this physical memory allocator would also be
> > possible to use as a backend for hugetlb.  Hopefully this would be an
> > allocator that would be generally useful for multiple purposes, something
> > like a mm/phys_alloc.c.
>
> Can you elaborate on this? mm/page_alloc.c already allocates physical
> memory :)
>
> Or you mean an allocator that will deal with memory carved out from what page
> allocator manages?
>
> > Frank van der Linden may also have thoughts on the above?

Yeah 'physical allocator' is a bit of a misnomer. You're right, an
allocator that deals with memory not under page allocator control is a
better description.

To elaborate a bit: there are various scenarios where allocating
contiguous stretches of physical memory is useful. HugeTLB, VM guest
memory. Or where you are presented with an external range of VM_PFNMAP
memory and need to manage it in a simple way and hand it out for guest
memory support (see NVidia's github for nvgrace-egm). However, all of
these cases may come with slightly different requirements: is the
memory purely external? Does it have struct pages? If so, is it in the
direct map? Is the memmap for the memory optimized (HVO-style)? Does
it need to be persistent? When does it need to be zeroed out?

So that's why it seems like a good idea to come up with a slightly
more generalized version of pool allocator - something that manages,
usually larger, chunks of physically contiguous memory. A is
initialized with certain properties (persistence, etc). It has methods
to grow and shrink the pool if needed. It's in no way meant to be
anywhere near as sophisticated as the page allocator, that would not
be useful (and pointless code duplication). A simple fixed-size chunk
pool will satisfy a lot of these cases.

A number of the building blocks are already there: there's CMA,
there's ZONE_DEVICE which has tools to manipulate some of these
properties (by going through a hotremove / hotplug cycle). I created a
simple prototype that essentially uses CMA as a pool provider, and
uses some ZONE_DEVICE tools to initialize memory however you want it
when it's added to the pool. I also added some new init code to to
avoid things like unneeded memmap allocation at boot for hugetlbfs
pages. I put hugetlbfs on top of it - but in a restricted way for
prototyping purposes (no reservations, no demotion).

Anyway, this is the basic idea.

- Frank