Kyungsan Kim wrote: > Hi Frank, > Thank you for your interest on this topic and remaining your opinion. > > >On Fri, Mar 31, 2023 at 6:42���AM Matthew Wilcox <willy@xxxxxxxxxxxxx> wrote: > >> > >> On Fri, Mar 31, 2023 at 08:42:20PM +0900, Kyungsan Kim wrote: > >> > Given our experiences/design and industry's viewpoints/inquiries, > >> > I will prepare a few slides in the session to explain > >> > 1. Usecase - user/kernespace memory tiering for near/far placement, memory virtualization between hypervisor/baremetal OS > >> > 2. Issue - movability(movable/unmovable), allocation(explicit/implicit), migration(intented/unintended) > >> > 3. HW - topology(direct, switch, fabric), feature(pluggability,error-handling,etc) > >> > >> I think you'll find everybody else in the room understands these issues > >> rather better than you do. This is hardly the first time that we've > >> talked about CXL, and CXL is not the first time that people have > >> proposed disaggregated memory, nor heterogenous latency/bandwidth > >> systems. All the previous attempts have failed, and I expect this > >> one to fail too. Maybe there's something novel that means this time > >> it really will work, so any slides you do should focus on that. > >> > >> A more profitable discussion might be: > >> > >> 1. Should we have the page allocator return pages from CXL or should > >> CXL memory be allocated another way? > >> 2. Should there be a way for userspace to indicate that it prefers CXL > >> memory when it calls mmap(), or should it always be at the discretion > >> of the kernel? > >> 3. Do we continue with the current ZONE_DEVICE model, or do we come up > >> with something new? > >> > >> > > > >Point 2 is what I proposed talking about here: > >https://lore.kernel.org/linux-mm/a80a4d4b-25aa-a38a-884f-9f119c03a1da@xxxxxxxxxx/T/ > > > >With the current cxl-as-numa-node model, an application can express a > >preference through mbind(). But that also means that mempolicy and > >madvise (e.g. MADV_COLD) are starting to overlap if the intention is > >to use cxl as a second tier for colder memory. Are these the right > >abstractions? Might it be more flexible to attach properties to memory > >ranges, and have applications hint which properties they prefer? > > We also think more userspace hints would be meaningful for diverse purposes of application. > Specific intefaces are need to be discussed, though. > > FYI in fact, we expanded mbind() and set_mempolicy() as well to explicitly bind DDR/CXL. > - mbind(,,MPOL_F_ZONE_EXMEM / MPOL_F_ZONE_NOEXMEM) > - set_mempolicy(,,MPOL_F_ZONE_EXMEM / MPOL_F_ZONE_NOEXMEM) > madvise() is also a candidate to express tiering intention. Need to be careful to explain why node numbers are not sufficient, because the need for new userspace ABI is a high bar. Recall that ZONE id bits and NUMA id bits are both coming from page->flags: #define NODES_PGSHIFT (NODES_PGOFF * (NODES_WIDTH != 0)) #define ZONES_PGSHIFT (ZONES_PGOFF * (ZONES_WIDTH != 0)) #define ZONES_MASK ((1UL << ZONES_WIDTH) - 1) #define NODES_MASK ((1UL << NODES_WIDTH) - 1) So when people declare that they are on "team ZONE" or "team NUMA" for this solution they are both on "team page->flags". Also have a look at the HMEM_REPORTING [1] interface and how it enumerates performance properties from initiator nodes to target nodes. There's no similar existing ABI for enumerating the performance of a ZONE. This is just to point out the momentum behind numbers in NODES_MASK having more meaning for conveying policy and enumerating performance than numbers in ZONES_MASK. [1]: https://www.kernel.org/doc/html/latest/admin-guide/mm/numaperf.html