RE: Re: FW: [LSF/MM/BPF TOPIC] SMDK inspired MM changes for CXL

Kyungsan Kim <ks0204.kim@xxxxxxxxxxx> · Fri, 7 Apr 2023 18:30:07 +0900

>On 05.04.23 21:42, Dan Williams wrote:
>> Matthew Wilcox wrote:
>>> On Tue, Apr 04, 2023 at 09:48:41PM -0700, Dan Williams wrote:
>>>> Kyungsan Kim wrote:
>>>>> We know the situation. When a CXL DRAM channel is located under ZONE_NORMAL,
>>>>> a random allocation of a kernel object by calling kmalloc() siblings makes the entire CXL DRAM unremovable.
>>>>> Also, not all kernel objects can be allocated from ZONE_MOVABLE.
>>>>>
>>>>> ZONE_EXMEM does not confine a movability attribute(movable or unmovable), rather it allows a calling context can decide it.
>>>>> In that aspect, it is the same with ZONE_NORMAL but ZONE_EXMEM works for extended memory device.
>>>>> It does not mean ZONE_EXMEM support both movability and kernel object allocation at the same time.
>>>>> In case multiple CXL DRAM channels are connected, we think a memory consumer possibly dedicate a channel for movable or unmovable purpose.
>>>>>
>>>>
>>>> I want to clarify that I expect the number of people doing physical CXL
>>>> hotplug of whole devices to be small compared to dynamic capacity
>>>> devices (DCD). DCD is a new feature of the CXL 3.0 specification where a
>>>> device maps 1 or more thinly provisioned memory regions that have
>>>> individual extents get populated and depopulated by a fabric manager.
>>>>
>>>> In that scenario there is a semantic where the fabric manager hands out
>>>> 100G to a host and asks for it back, it is within the protocol that the
>>>> host can say "I can give 97GB back now, come back and ask again if you
>>>> need that last 3GB".
>>>
>>> Presumably it can't give back arbitrary chunks of that 100GB?  There's
>>> some granularity that's preferred; maybe on 1GB boundaries or something?
>> 
>> The device picks a granularity that can be tiny per spec, but it makes
>> the hardware more expensive to track in small extents, so I expect
>> something reasonable like 1GB, but time will tell once actual devices
>> start showing up.
>
>It all sounds a lot like virtio-mem using real hardware [I know, there 
>are important differences, but for the dynamic aspect there are very 
>similar issues to solve]
>
>Fir virtio-mem, the current best way to support hotplugging of large 
>memory to a VM to eventually be able to unplug a big fraction again is 
>using a combination of ZONE_MOVABLE and ZONE_NORMAL -- "auto-movable" 
>memory onlining policy. What's online to ZONE_MOVABLE can get (fairly) 
>reliably unplugged again. What's onlined to ZONE_NORMAL is possibly lost 
>forever.
>
>Like (incrementally) hotplugging 1 TiB to a 4 GiB VM. Being able to 
>unplug 1 TiB reliably again is pretty much out of scope. But the more 
>memory we can reliably get back the better. And the more memory we can 
>get in the common case, the better. With a ZONE_NORMAL vs. ZONE_MOVABLE 
>ration of 1:3 on could unplug ~768 GiB again reliably. The remainder 
>depends on fragmentation on the actual system and the unplug granularity.
>
>The original plan was to use ZONE_PREFER_MOVABLE as a safety buffer to 
>reduce ZONE_NORMAL memory without increasing ZONE_MOVABLE memory (and 
>possibly harming the system). The underlying idea was that in many 
>setups that memory in ZONE_PREFER_MOVABLE would not get used for 
>unmovable allocations and it could, therefore, get unplugged fairly 
>reliably in these setups. For all other setups, unmmovable allocations 
>could leak into ZONE_PREFER_MOVABLE and reduce the number of memory we 
>could unplug again. But the system would try to keep unmovable 
>allocations to ZONE_NORMAL, so in most cases with some 
>ZONE_PREFER_MOVABLE memory we would perform better than with only 
>ZONE_NORMAL.

Probably memory hotplug mechanism would be separated into two stages, physical memory add/remove and logical memory on/offline[1].
We think ZONE_PREFER_MOVABLE could help logical memory on/offline. But, there would be trade-off between physical add/remove and device utilization.
In case of ZONE_PREFER_MOVABLE allocation on switched CXL DRAM devices, 
when pages are evenly allocated among physical CXL DRAM devices, then it would not help physical memory add/remove.
Meanwhile, when page are sequentially allocated among physical CXL DRAM devices, it would be opposite.

ZONE_EXMEM provides provision of CXL DRAM devices[2], we think the idea of ZONE_PREFER_MOVABLE idea can be applied on that.
For example, preferred movable page per CXL DRAM device within the zone.

[1] https://docs.kernel.org/admin-guide/mm/memory-hotplug.html#phases-of-memory-hotplug
[2] https://github.com/OpenMPDK/SMDK/wiki/2.-SMDK-Architecture#memory-partition
>
>-- 
>Thanks,
>
>David / dhildenb