>Hi Mike, > >On 4/3/23 03:44, Mike Rapoport wrote: >> Hi Dragan, >> >> On Thu, Mar 30, 2023 at 05:03:24PM -0500, Dragan Stancevic wrote: >>> On 3/26/23 02:21, Mike Rapoport wrote: >>>> Hi, >>>> >>>> [..] >> One problem we experienced was occured in the combination of >>> hot-remove and kerelspace allocation usecases. >>>>> ZONE_NORMAL allows kernel context allocation, but it does not allow hot-remove because kernel resides all the time. >>>>> ZONE_MOVABLE allows hot-remove due to the page migration, but it only allows userspace allocation. >>>>> Alternatively, we allocated a kernel context out of ZONE_MOVABLE by adding GFP_MOVABLE flag. >>>>> In case, oops and system hang has occasionally occured because ZONE_MOVABLE can be swapped. >>>>> We resolved the issue using ZONE_EXMEM by allowing seletively choice of the two usecases. >>>>> As you well know, among heterogeneous DRAM devices, CXL DRAM is the first PCIe basis device, which allows hot-pluggability, different RAS, and extended connectivity. >>>>> So, we thought it could be a graceful approach adding a new zone and separately manage the new features. >>>> >>>> This still does not describe what are the use cases that require having >>>> kernel allocations on CXL.mem. >>>> >>>> I believe it's important to start with explanation *why* it is important to >>>> have kernel allocations on removable devices. >>> >>> Hi Mike, >>> >>> not speaking for Kyungsan here, but I am starting to tackle hypervisor >>> clustering and VM migration over cxl.mem [1]. >>> >>> And in my mind, at least one reason that I can think of having kernel >>> allocations from cxl.mem devices is where you have multiple VH connections >>> sharing the memory [2]. Where for example you have a user space application >>> stored in cxl.mem, and then you want the metadata about this >>> process/application that the kernel keeps on one hypervisor be "passed on" >>> to another hypervisor. So basically the same way processors in a single >>> hypervisors cooperate on memory, you extend that across processors that span >>> over physical hypervisors. If that makes sense... >> >> Let me reiterate to make sure I understand your example. >> If we focus on VM usecase, your suggestion is to store VM's memory and >> associated KVM structures on a CXL.mem device shared by several nodes. > >Yes correct. That is what I am exploring, two different approaches: > >Approach 1: Use CXL.mem for VM migration between hypervisors. In this >approach the VM and the metadata executes/resides on a traditional NUMA >node (cpu+dram) and only uses CXL.mem to transition between hypervisors. >It's not kept permanently there. So basically on hypervisor A you would >do something along the lines of migrate_pages into cxl.mem and then on >hypervisor B you would migrate_pages from cxl.mem and onto the regular >NUMA node (cpu+dram). > >Approach 2: Use CXL.mem to cluster hypervisors to improve high >availability of VMs. In this approach the VM and metadata would be kept >in CXL.mem permanently and each hypervisor accessing this shared memory >could have the potential to schedule/run the VM if the other hypervisor >experienced a failure. > >> Even putting aside the aspect of keeping KVM structures on presumably >> slower memory, > >Totally agree, presumption of memory speed dully noted. As far as I am >aware, CXL.mem at this point has higher latency than DRAM, and switched >CXL.mem has an additional latency. That may or may not change in the >future, but even with actual CXL induced latency I think there are >benefits to the approaches. > >In the example #1 above, I think even if you had a very noisy VM that is >dirtying pages at a high rate, once migrate_pages has occurred, it >wouldn't have to be quiesced for the migration to happen. A migration >could basically occur in-between the CPU slices, once VCPU is done with >it's slice on hypervisor A, the next slice could be on hypervisor B. > >And the example #2 above, you are trading memory speed for >high-availability. Where either hypervisor A or B could run the CPU load >of the VM. You could even have a VM where some of the VCPUs are >executing on hypervisor A and others on hypervisor B to be able to shift >CPU load across hypervisors in quasi real-time. > > >> what ZONE_EXMEM will provide that cannot be accomplished >> with having the cxl memory in a memoryless node and using that node to >> allocate VM metadata? > >It has crossed my mind to perhaps use NUMA node distance for the two >approaches above. But I think that is not sufficient because we can have >varying distance, and distance in itself doesn't indicate >switched/shared CXL.mem or non-switched/non-shared CXL.mem. Strictly >speaking just for myself here, with the two approaches above, the >crucial differentiator in order for #1 and #2 to work would be that >switched/shared CXL.mem would have to be indicated as such in a way. >Because switched memory would have to be treated and formatted in some >kind of ABI way that would allow hypervisors to cooperate and follow >certain protocols when using this memory. > > >I can't answer what ZONE_EXMEM will provide since we haven's seen >Kyungsan's talk yet, that's why I myself was very curious to find out >more about ZONE_EXMEM proposal and if it includes some provisions for >CXL switched/shared memory. > >To me, I don't think it makes a difference if pages are coming from >ZONE_NORMAL, or ZONE_EXMEM but the part that I was curious about was if >I could allocate from or migrate_pages to (ZONE_EXMEM | type >"SWITCHED/SHARED"). So it's not the zone that is crucial for me, it's >the typing. That's what I meant with my initial response but I guess it >wasn't clear enough, "_if_ ZONE_EXMEM had some typing mechanism, in my >case, this is where you'd have kernel allocations on CXL.mem" Hi Dragan, I'm sorry for late reply, we are trying to reply well, though. ZONE_EXMEM can be movable. A calling context is able to determine movability(movable/unmovable). I'm not sure if it is related to the provision you keep in mind, but ZONE_EXMEM allows capacity and bandwidth aggregation among multiple CXL DRAM channels. Multiple CXL DRAM can be grouped into a ZONE_EXMEM, then it is able to be exposed as a single memory-node[1]. Along with the increase of CXL DRAM channels through (multi-level) switch and enhanced CXL server system, we thought kernel should manage it seamlessly. Otherwise, userspace would see many nodes, then a 3rd party tool would be always needed such as numactl and libnuma. Of course, CXL switch can do the part, but HW/SW means have pros and cons in many ways, so we thought it would be co-existable. Also, upon the composability expectation of CXL, I think memory sharing among VM/KVM instances well fits with CXL. This is just a gut now, but a security and permission matter would be handled in the zone dimension possibly. In general, given CXL nature(PCIe basis) and topology expansions(direct->switches->fabrics), let us carefully guess more functionality and performance matter would be raised. We have proposed ZONE_EXMEM as a separated logical management dimension for extended memory types, as of now CXL DRAM. To help your clarify, please find the slide that explains our proposal[2]. [1] https://github.com/OpenMPDK/SMDK/wiki/2.-SMDK-Architecture#memory-partition [2] https://github.com/OpenMPDK/SMDK/wiki/93.-%5BLSF-MM-BPF-TOPIC%5D-SMDK-inspired-MM-changes-for-CXL > > >Sorry if it got long, hope that makes sense... :) > > >> >>> [1] A high-level explanation is at https://protect2.fireeye.com/v1/url?k=4536d55f-244b3fdc-45375e10-74fe48600158-3fa306550dc8830d&q=1&e=afaf972f-90cd-4c53-b50f-bead1fea18a3&u=http%3A%2F%2Fnil-migration.org%2F >>> [2] Compute Express Link Specification r3.0, v1.0 8/1/22, Page 51, figure >>> 1-4, black color scheme circle(3) and bars. >>> >