Re: [LSF/MM/BPF TOPIC] BoF VM live migration over CXL memory

Gregory Price <gregory.price@xxxxxxxxxxxx> · Fri, 7 Apr 2023 20:05:50 -0400

On Fri, Apr 07, 2023 at 04:05:31PM -0500, Dragan Stancevic wrote:
> Hi folks-
> 
> if it's not too late for the schedule...
> 
> I am starting to tackle VM live migration and hypervisor clustering over
> switched CXL memory[1][2], intended for cloud virtualization types of loads.
> 
> I'd be interested in doing a small BoF session with some slides and get into
> a discussion/brainstorming with other people that deal with VM/LM cloud
> loads. Among other things to discuss would be page migrations over switched
> CXL memory, shared in-memory ABI to allow VM hand-off between hypervisors,
> etc...
> 
> A few of us discussed some of this under the ZONE_XMEM thread, but I figured
> it might be better to start a separate thread.
> 
> If there is interested, thank you.
> 
> 
> [1]. High-level overview available at http://nil-migration.org/
> [2]. Based on CXL spec 3.0
> 
> --
> Peace can only come as a natural consequence
> of universal enlightenment -Dr. Nikola Tesla

I've been chatting about this with folks offline, figure i'll toss my
thoughts on the issue here.

Some things to consider:

1. If secure-compute is being used, then this mechanism won't work as
   pages will be pinned, and therefore not movable and excluded from
   using cxl memory at all.

   This issue does not exist with traditional live migration, because
   typically some kind of copy is used from one virtual space to another
   (i.e. RMDA), so pages aren't really migrated in the kernel memory
   block/numa node sense.

2. During the migration process, the memory needs to be forced not to be
   migrated to another node by other means (tiering software, swap,
   etc).  The obvious way of doing this would be to migrate and
   temporarily pin the page... but going back to problem #1 we see that
   ZONE_MOVABLE and Pinning are mutually exclusive.  So that's
   troublesome.

3. This is changing the semantics of migration from a virtual memory
   movement to a physical memory movement.  Typically you would expect
   the RDMA process for live migration to work something like...

   a) migration request arrives
   b) source host informs destination host of size requirements
   c) destination host allocations memory and passes a Virtual Address
      back to source host
   d) source host initates an RDMA from HostA-VA to HostB-VA
   e) CPU task is migrated

   Importantly, the allocation of memory by Host B handles the important
   step of creating HVA->HPA mappings, and the Extended/Nested Page
   Tables can simply be flushed and re-created after the VM is fully
   migrated.

   to long didn't read: live migration is a virtual address operation,
   and node-migration is a PHYSICAL address operation, the virtual
   addresses remain the same.

   This is problematic, as it's changing the underlying semantics of the
   migration operation.

Problem #1 and #2 are head-scratchers, but maybe solvable.

Problem #3 is the meat and potatos of the issue in my opinion. So lets
consider that a little more closely.

Generically: NIL Migration is basically a pass by reference operation.

The reference in this case is... the page tables.  You need to know how
to interpret the data in the CXL memory region on the remote host, and
that's a "relative page table translation" (to coin a phrase? I'm not
sure how to best describe it).

That's... complicated to say the least.
1) Pages on the physical hardware do not need to be contiguous
2) The CFMW on source and target host do not need to be mapped at the
   same place
3) There's not pre-allocation in these charts, and migration isn't
   targeted, so having the source-host "expertly place" the data isn't
   possible (right now, i suppose you could make kernel extensions).
4) Similar to problem #2 above, even with a pre-allocate added in, you
   would need to ensure those mappings were pinned during migration,
   lest the target host end up swapping a page or something.

An Option:  Make pages physically contiguous on migration to CXL

In this case, you don't necessarily care about the Host Virtual
Addresses, what you actually care about are the structure of the pages
in memory (are they physically contiguous? or do you need to
reconstruct the contiguity by inspecting the page tables?).

If a migration API were capable of reserving large swaths of contiguous
CXL memory, you could discard individual page information and instead
send page range information, reconstructing the virtual-physical
mappings this way.

That's about as far as I've thought about it so far.  Feel free to rip
it apart! :]

~Gregory

Re: [LSF/MM/BPF TOPIC] BoF VM live migration over CXL memory​

Re: [LSF/MM/BPF TOPIC] BoF VM live migration over CXL memory