Re: [LSF/MM/BPF TOPIC] BoF VM live migration over CXL memory

Dragan Stancevic <dragan@xxxxxxxxxxxxx> · Mon, 10 Apr 2023 19:56:01 -0500

Hi Gregory-

On 4/7/23 19:05, Gregory Price wrote:
On Fri, Apr 07, 2023 at 04:05:31PM -0500, Dragan Stancevic wrote:
Hi folks-

if it's not too late for the schedule...

I am starting to tackle VM live migration and hypervisor clustering over
switched CXL memory[1][2], intended for cloud virtualization types of loads.

I'd be interested in doing a small BoF session with some slides and get into
a discussion/brainstorming with other people that deal with VM/LM cloud
loads. Among other things to discuss would be page migrations over switched
CXL memory, shared in-memory ABI to allow VM hand-off between hypervisors,
etc...

A few of us discussed some of this under the ZONE_XMEM thread, but I figured
it might be better to start a separate thread.

If there is interested, thank you.

[1]. High-level overview available at http://nil-migration.org/
[2]. Based on CXL spec 3.0

--
Peace can only come as a natural consequence
of universal enlightenment -Dr. Nikola Tesla

I've been chatting about this with folks offline, figure i'll toss my
thoughts on the issue here.

excellent brain dump, thank you

Some things to consider:

1. If secure-compute is being used, then this mechanism won't work as
    pages will be pinned, and therefore not movable and excluded from
    using cxl memory at all.

    This issue does not exist with traditional live migration, because
    typically some kind of copy is used from one virtual space to another
    (i.e. RMDA), so pages aren't really migrated in the kernel memory
    block/numa node sense.

right, agreed... I don't think we can migrate in all scenarios, such as 
pinning or forms of pass-through, etc

my opinion just to start off, as a base requirement, would be that the 
pages be movable.

2. During the migration process, the memory needs to be forced not to be
    migrated to another node by other means (tiering software, swap,
    etc).  The obvious way of doing this would be to migrate and
    temporarily pin the page... but going back to problem #1 we see that
    ZONE_MOVABLE and Pinning are mutually exclusive.  So that's
    troublesome.

Yeah, true. I'd have to check the code, but I wonder if perhaps we could 
mapcount or refount the pages upon migration onto CLX switched memory. 
If my memory serves me right, wouldn't the move_pages back off or stall? 
I guess it's TBD, how workable or useful that would be but it's good to 
be thinking of different ways of doing this

3. This is changing the semantics of migration from a virtual memory
    movement to a physical memory movement.  Typically you would expect
    the RDMA process for live migration to work something like...

    a) migration request arrives
    b) source host informs destination host of size requirements
    c) destination host allocations memory and passes a Virtual Address
       back to source host
    d) source host initates an RDMA from HostA-VA to HostB-VA
    e) CPU task is migrated

    Importantly, the allocation of memory by Host B handles the important
    step of creating HVA->HPA mappings, and the Extended/Nested Page
    Tables can simply be flushed and re-created after the VM is fully
    migrated.

    to long didn't read: live migration is a virtual address operation,
    and node-migration is a PHYSICAL address operation, the virtual
    addresses remain the same.

    This is problematic, as it's changing the underlying semantics of the
    migration operation.

Those are all valid points, but what if you don't need to recreate 
HVA->HPA mappings? If I am understanding the CXL 3.0 spec correctly, 
then both virtual addresses and physical addresses wouldn't have to 
change. Because the fabric "virtualizes" host physical addresses and the 
translation is done by the G-FAM/GFD that has the capability to 
translate multi-host HPAs to it's internal DPAs. So if you have two 
hypervisors seeing device physical address as the same physical address, 
that might work?

Problem #1 and #2 are head-scratchers, but maybe solvable.

Problem #3 is the meat and potatos of the issue in my opinion. So lets
consider that a little more closely.

Generically: NIL Migration is basically a pass by reference operation.

Yup, agreed

The reference in this case is... the page tables.  You need to know how
to interpret the data in the CXL memory region on the remote host, and
that's a "relative page table translation" (to coin a phrase? I'm not
sure how to best describe it).

right, coining phrases... I have been thinking of a "super-page" (for 
the lack of a better word) a metadata region sitting on the switched 
CXL.mem device that would allow hypervisors to synchronize on various 
aspects, such as "relative page table translation", host is up, host is 
down, list of peers, who owns what etc... In a perfect scenario, I would 
love to see the hypervisors cooperating on switched CXL.mem device the 
same way cpus on different numa nodes cooperate on memory in a single 
hypervisor. If either host can allocate and schedule from this space 
then "NIL" aspect of migration is "free".

That's... complicated to say the least.
1) Pages on the physical hardware do not need to be contiguous
2) The CFMW on source and target host do not need to be mapped at the
    same place
3) There's not pre-allocation in these charts, and migration isn't
    targeted, so having the source-host "expertly place" the data isn't
    possible (right now, i suppose you could make kernel extensions).
4) Similar to problem #2 above, even with a pre-allocate added in, you
    would need to ensure those mappings were pinned during migration,
    lest the target host end up swapping a page or something.

An Option:  Make pages physically contiguous on migration to CXL

In this case, you don't necessarily care about the Host Virtual
Addresses, what you actually care about are the structure of the pages
in memory (are they physically contiguous? or do you need to
reconstruct the contiguity by inspecting the page tables?).

If a migration API were capable of reserving large swaths of contiguous
CXL memory, you could discard individual page information and instead
send page range information, reconstructing the virtual-physical
mappings this way.

yeah, good points, but this is all tricky though... it seems this would 
require quiescing the VM and that is something I would like to avoid if 
possible. I'd like to see the VM still executing while all of it's pages 
are migrated onto CXL NUMA on the source hypervisor. And I would like to 
see the VM executing on the destination hypervisor while migrate_pages 
is moving pages off of CXL. Of course, what you are describing above 
would still be a very fast VM migration, but would require quiescing.

That's about as far as I've thought about it so far.  Feel free to rip
it apart! :]

Those are all great thoughts and I appreciate you sharing them. I don't 
have all the answers either :)

~Gregory

--
Peace can only come as a natural consequence
of universal enlightenment -Dr. Nikola Tesla

Re: [LSF/MM/BPF TOPIC] BoF VM live migration over CXL memory​

Re: [LSF/MM/BPF TOPIC] BoF VM live migration over CXL memory