Re: [LSF/MM/BPF TOPIC] BoF VM live migration over CXL memory

Gregory Price <gregory.price@xxxxxxxxxxxx> · Mon, 10 Apr 2023 21:48:08 -0400

On Mon, Apr 10, 2023 at 07:56:01PM -0500, Dragan Stancevic wrote:
> Hi Gregory-
> 
> On 4/7/23 19:05, Gregory Price wrote:
> > 3. This is changing the semantics of migration from a virtual memory
> >     movement to a physical memory movement.  Typically you would expect
> >     the RDMA process for live migration to work something like...
> > 
> >     a) migration request arrives
> >     b) source host informs destination host of size requirements
> >     c) destination host allocations memory and passes a Virtual Address
> >        back to source host
> >     d) source host initates an RDMA from HostA-VA to HostB-VA
> >     e) CPU task is migrated
> > 
> >     Importantly, the allocation of memory by Host B handles the important
> >     step of creating HVA->HPA mappings, and the Extended/Nested Page
> >     Tables can simply be flushed and re-created after the VM is fully
> >     migrated.
> > 
> >     to long didn't read: live migration is a virtual address operation,
> >     and node-migration is a PHYSICAL address operation, the virtual
> >     addresses remain the same.
> > 
> >     This is problematic, as it's changing the underlying semantics of the
> >     migration operation.
> 
> Those are all valid points, but what if you don't need to recreate HVA->HPA
> mappings? If I am understanding the CXL 3.0 spec correctly, then both
> virtual addresses and physical addresses wouldn't have to change. Because
> the fabric "virtualizes" host physical addresses and the translation is done
> by the G-FAM/GFD that has the capability to translate multi-host HPAs to
> it's internal DPAs. So if you have two hypervisors seeing device physical
> address as the same physical address, that might work?
> 
> 

Hm.  I hadn't considered the device side translation (decoders), though
that's obviously a tool in the toolbox.  You still have to know how to
slide ranges of data (which you mention below).

> 
> > The reference in this case is... the page tables.  You need to know how
> > to interpret the data in the CXL memory region on the remote host, and
> > that's a "relative page table translation" (to coin a phrase? I'm not
> > sure how to best describe it).
> 
> right, coining phrases... I have been thinking of a "super-page" (for the
> lack of a better word) a metadata region sitting on the switched CXL.mem
> device that would allow hypervisors to synchronize on various aspects, such
> as "relative page table translation", host is up, host is down, list of
> peers, who owns what etc... In a perfect scenario, I would love to see the
> hypervisors cooperating on switched CXL.mem device the same way cpus on
> different numa nodes cooperate on memory in a single hypervisor. If either
> host can allocate and schedule from this space then "NIL" aspect of
> migration is "free".
> 
> 

The core of the problem is still that each of the hosts has to agree on
the location (physically) of this region of memory, which could be
problematic unless you have very strong BIOS and/or kernel driver
controls to ensure certain devices are guaranteed to be mapped into
certain spots in the CFMW.

After that it's a matter of treating this memory as incoherent shared
memory and handling ownership in a safe way.  If the memory is only used
for migrations, then you don't have to worry about performance.

So I agree, as long as shared memory mapped into the same CFMW area is
used, this mechanism is totally sound.

My main concerns are that I don't know of a mechanism to ensure that.  I
suppose for those interested, and with special BIOS/EFI, you could do
that - but I think that's going to be a tall ask in a heterogenous cloud
environment.

> > That's... complicated to say the least.
> > 
> > <... snip ...>
> > 
> > An Option:  Make pages physically contiguous on migration to CXL
> > 
> > In this case, you don't necessarily care about the Host Virtual
> > Addresses, what you actually care about are the structure of the pages
> > in memory (are they physically contiguous? or do you need to
> > reconstruct the contiguity by inspecting the page tables?).
> > 
> > If a migration API were capable of reserving large swaths of contiguous
> > CXL memory, you could discard individual page information and instead
> > send page range information, reconstructing the virtual-physical
> > mappings this way.
> 
> yeah, good points, but this is all tricky though... it seems this would
> require quiescing the VM and that is something I would like to avoid if
> possible. I'd like to see the VM still executing while all of it's pages are
> migrated onto CXL NUMA on the source hypervisor. And I would like to see the
> VM executing on the destination hypervisor while migrate_pages is moving
> pages off of CXL. Of course, what you are describing above would still be a
> very fast VM migration, but would require quiescing.
> 
>

Possibly.  If you're going to quiesce you're probably better off just
snapshotting to shared memory and migrating the snapshot.

Maybe that's the better option for a first-pass migration mechanism.  I
don't know.

Anyway, would love to attend this session.

~Gregory

Re: [LSF/MM/BPF TOPIC] BoF VM live migration over CXL memory​

Re: [LSF/MM/BPF TOPIC] BoF VM live migration over CXL memory