RE: Re: RE: FW: [LSF/MM/BPF TOPIC] SMDK inspired MM changes for CXL

Dan Williams <dan.j.williams@xxxxxxxxx> · Tue, 4 Apr 2023 22:00:03 -0700

Kyungsan Kim wrote:
> Hi Frank, 
> Thank you for your interest on this topic and remaining your opinion.
> 
> >On Fri, Mar 31, 2023 at 6:42���AM Matthew Wilcox <willy@xxxxxxxxxxxxx> wrote:
> >>
> >> On Fri, Mar 31, 2023 at 08:42:20PM +0900, Kyungsan Kim wrote:
> >> > Given our experiences/design and industry's viewpoints/inquiries,
> >> > I will prepare a few slides in the session to explain
> >> >   1. Usecase - user/kernespace memory tiering for near/far placement, memory virtualization between hypervisor/baremetal OS
> >> >   2. Issue - movability(movable/unmovable), allocation(explicit/implicit), migration(intented/unintended)
> >> >   3. HW - topology(direct, switch, fabric), feature(pluggability,error-handling,etc)
> >>
> >> I think you'll find everybody else in the room understands these issues
> >> rather better than you do.  This is hardly the first time that we've
> >> talked about CXL, and CXL is not the first time that people have
> >> proposed disaggregated memory, nor heterogenous latency/bandwidth
> >> systems.  All the previous attempts have failed, and I expect this
> >> one to fail too.  Maybe there's something novel that means this time
> >> it really will work, so any slides you do should focus on that.
> >>
> >> A more profitable discussion might be:
> >>
> >> 1. Should we have the page allocator return pages from CXL or should
> >>    CXL memory be allocated another way?
> >> 2. Should there be a way for userspace to indicate that it prefers CXL
> >>    memory when it calls mmap(), or should it always be at the discretion
> >>    of the kernel?
> >> 3. Do we continue with the current ZONE_DEVICE model, or do we come up
> >>    with something new?
> >>
> >>
> >
> >Point 2 is what I proposed talking about here:
> >https://lore.kernel.org/linux-mm/a80a4d4b-25aa-a38a-884f-9f119c03a1da@xxxxxxxxxx/T/
> >
> >With the current cxl-as-numa-node model, an application can express a
> >preference through mbind(). But that also means that mempolicy and
> >madvise (e.g. MADV_COLD) are starting to overlap if the intention is
> >to use cxl as a second tier for colder memory.  Are these the right
> >abstractions? Might it be more flexible to attach properties to memory
> >ranges, and have applications hint which properties they prefer?
> 
> We also think more userspace hints would be meaningful for diverse purposes of application.
> Specific intefaces are need to be discussed, though.
> 
> FYI in fact, we expanded mbind() and set_mempolicy() as well to explicitly bind DDR/CXL.
>   - mbind(,,MPOL_F_ZONE_EXMEM / MPOL_F_ZONE_NOEXMEM) 
>   - set_mempolicy(,,MPOL_F_ZONE_EXMEM / MPOL_F_ZONE_NOEXMEM)
> madvise() is also a candidate to express tiering intention.

Need to be careful to explain why node numbers are not sufficient,
because the need for new userspace ABI is a high bar.

Recall that ZONE id bits and NUMA id bits are both coming from
page->flags:

#define NODES_PGSHIFT           (NODES_PGOFF * (NODES_WIDTH != 0))
#define ZONES_PGSHIFT           (ZONES_PGOFF * (ZONES_WIDTH != 0))
#define ZONES_MASK              ((1UL << ZONES_WIDTH) - 1)
#define NODES_MASK              ((1UL << NODES_WIDTH) - 1)

So when people declare that they are on "team ZONE" or "team NUMA" for
this solution they are both on "team page->flags".

Also have a look at the HMEM_REPORTING [1] interface and how it
enumerates performance properties from initiator nodes to target nodes.
There's no similar existing ABI for enumerating the performance of a
ZONE. This is just to point out the momentum behind numbers in
NODES_MASK having more meaning for conveying policy and enumerating
performance than numbers in ZONES_MASK.

[1]: https://www.kernel.org/doc/html/latest/admin-guide/mm/numaperf.html