Hi Mike,
On 4/3/23 03:44, Mike Rapoport wrote:
Hi Dragan,
On Thu, Mar 30, 2023 at 05:03:24PM -0500, Dragan Stancevic wrote:
On 3/26/23 02:21, Mike Rapoport wrote:
Hi,
[..] >> One problem we experienced was occured in the combination of
hot-remove and kerelspace allocation usecases.
ZONE_NORMAL allows kernel context allocation, but it does not allow hot-remove because kernel resides all the time.
ZONE_MOVABLE allows hot-remove due to the page migration, but it only allows userspace allocation.
Alternatively, we allocated a kernel context out of ZONE_MOVABLE by adding GFP_MOVABLE flag.
In case, oops and system hang has occasionally occured because ZONE_MOVABLE can be swapped.
We resolved the issue using ZONE_EXMEM by allowing seletively choice of the two usecases.
As you well know, among heterogeneous DRAM devices, CXL DRAM is the first PCIe basis device, which allows hot-pluggability, different RAS, and extended connectivity.
So, we thought it could be a graceful approach adding a new zone and separately manage the new features.
This still does not describe what are the use cases that require having
kernel allocations on CXL.mem.
I believe it's important to start with explanation *why* it is important to
have kernel allocations on removable devices.
Hi Mike,
not speaking for Kyungsan here, but I am starting to tackle hypervisor
clustering and VM migration over cxl.mem [1].
And in my mind, at least one reason that I can think of having kernel
allocations from cxl.mem devices is where you have multiple VH connections
sharing the memory [2]. Where for example you have a user space application
stored in cxl.mem, and then you want the metadata about this
process/application that the kernel keeps on one hypervisor be "passed on"
to another hypervisor. So basically the same way processors in a single
hypervisors cooperate on memory, you extend that across processors that span
over physical hypervisors. If that makes sense...
Let me reiterate to make sure I understand your example.
If we focus on VM usecase, your suggestion is to store VM's memory and
associated KVM structures on a CXL.mem device shared by several nodes.
Yes correct. That is what I am exploring, two different approaches:
Approach 1: Use CXL.mem for VM migration between hypervisors. In this
approach the VM and the metadata executes/resides on a traditional
NUMA node (cpu+dram) and only uses CXL.mem to transition between
hypervisors. It's not kept permanently there. So basically on
hypervisor A you would do something along the lines of migrate_pages
into cxl.mem and then on hypervisor B you would migrate_pages from
cxl.mem and onto the regular NUMA node (cpu+dram).
Approach 2: Use CXL.mem to cluster hypervisors to improve high
availability of VMs. In this approach the VM and metadata would be
kept in CXL.mem permanently and each hypervisor accessing this shared
memory could have the potential to schedule/run the VM if the other
hypervisor experienced a failure.
Even putting aside the aspect of keeping KVM structures on presumably
slower memory,
Totally agree, presumption of memory speed dully noted. As far as I am
aware, CXL.mem at this point has higher latency than DRAM, and
switched CXL.mem has an additional latency. That may or may not change
in the future, but even with actual CXL induced latency I think there
are benefits to the approaches.
In the example #1 above, I think even if you had a very noisy VM that
is dirtying pages at a high rate, once migrate_pages has occurred, it
wouldn't have to be quiesced for the migration to happen. A migration
could basically occur in-between the CPU slices, once VCPU is done
with it's slice on hypervisor A, the next slice could be on hypervisor
B.
And the example #2 above, you are trading memory speed for
high-availability. Where either hypervisor A or B could run the CPU
load of the VM. You could even have a VM where some of the VCPUs are
executing on hypervisor A and others on hypervisor B to be able to
shift CPU load across hypervisors in quasi real-time.
what ZONE_EXMEM will provide that cannot be accomplished
with having the cxl memory in a memoryless node and using that node to
allocate VM metadata?
It has crossed my mind to perhaps use NUMA node distance for the two
approaches above. But I think that is not sufficient because we can
have varying distance, and distance in itself doesn't indicate
switched/shared CXL.mem or non-switched/non-shared CXL.mem. Strictly
speaking just for myself here, with the two approaches above, the
crucial differentiator in order for #1 and #2 to work would be that
switched/shared CXL.mem would have to be indicated as such in a way.
Because switched memory would have to be treated and formatted in some
kind of ABI way that would allow hypervisors to cooperate and follow
certain protocols when using this memory.
I can't answer what ZONE_EXMEM will provide since we haven's seen
Kyungsan's talk yet, that's why I myself was very curious to find out
more about ZONE_EXMEM proposal and if it includes some provisions for
CXL switched/shared memory.
To me, I don't think it makes a difference if pages are coming from
ZONE_NORMAL, or ZONE_EXMEM but the part that I was curious about was
if I could allocate from or migrate_pages to (ZONE_EXMEM | type
"SWITCHED/SHARED"). So it's not the zone that is crucial for me, it's
the typing. That's what I meant with my initial response but I guess
it wasn't clear enough, "_if_ ZONE_EXMEM had some typing mechanism, in
my case, this is where you'd have kernel allocations on CXL.mem"