On Tue, Apr 04, 2023 at 09:48:41PM -0700, Dan Williams wrote: > Kyungsan Kim wrote: > > We know the situation. When a CXL DRAM channel is located under ZONE_NORMAL, > > a random allocation of a kernel object by calling kmalloc() siblings makes the entire CXL DRAM unremovable. > > Also, not all kernel objects can be allocated from ZONE_MOVABLE. > > > > ZONE_EXMEM does not confine a movability attribute(movable or unmovable), rather it allows a calling context can decide it. > > In that aspect, it is the same with ZONE_NORMAL but ZONE_EXMEM works for extended memory device. > > It does not mean ZONE_EXMEM support both movability and kernel object allocation at the same time. > > In case multiple CXL DRAM channels are connected, we think a memory consumer possibly dedicate a channel for movable or unmovable purpose. > > > > I want to clarify that I expect the number of people doing physical CXL > hotplug of whole devices to be small compared to dynamic capacity > devices (DCD). DCD is a new feature of the CXL 3.0 specification where a > device maps 1 or more thinly provisioned memory regions that have > individual extents get populated and depopulated by a fabric manager. > > In that scenario there is a semantic where the fabric manager hands out > 100G to a host and asks for it back, it is within the protocol that the > host can say "I can give 97GB back now, come back and ask again if you > need that last 3GB". Presumably it can't give back arbitrary chunks of that 100GB? There's some granularity that's preferred; maybe on 1GB boundaries or something? > In other words even pinned pages in ZONE_MOVABLE are not fatal to the > flow. Alternatively, if a deployment needs 100% guarantees that the host > will return all the memory it was assigned when asked there is always > the option to keep that memory out of the page allocator and just access > it via a device. That's the role device-dax plays for "dedicated" memory > that needs to be set aside from kernel allocations. > > This is to say something like ZONE_PREFER_MOVABLE semantics can be > handled within the DCD protocol, where 100% unpluggability is not > necessary and 97% is good enough. This certainly makes life better (and rather more like hypervisor shrinking than like DIMM hotplug), but I think fragmentation may well result in "only 3GB of 100GB allocated" will result in being able to return less than 50% of the memory, depending on granule size and exactly how the allocations got chunked.