Re: [LSF/MM/BPF TOPIC] BoF VM live migration over CXL memory

Dragan Stancevic <dragan@xxxxxxxxxxxxx> · Thu, 13 Apr 2023 22:32:48 -0500

Hi Gregory-

On 4/10/23 20:48, Gregory Price wrote:
On Mon, Apr 10, 2023 at 07:56:01PM -0500, Dragan Stancevic wrote:
Hi Gregory-

On 4/7/23 19:05, Gregory Price wrote:
3. This is changing the semantics of migration from a virtual memory
     movement to a physical memory movement.  Typically you would expect
     the RDMA process for live migration to work something like...

     a) migration request arrives
     b) source host informs destination host of size requirements
     c) destination host allocations memory and passes a Virtual Address
        back to source host
     d) source host initates an RDMA from HostA-VA to HostB-VA
     e) CPU task is migrated

     Importantly, the allocation of memory by Host B handles the important
     step of creating HVA->HPA mappings, and the Extended/Nested Page
     Tables can simply be flushed and re-created after the VM is fully
     migrated.

     to long didn't read: live migration is a virtual address operation,
     and node-migration is a PHYSICAL address operation, the virtual
     addresses remain the same.

     This is problematic, as it's changing the underlying semantics of the
     migration operation.

Those are all valid points, but what if you don't need to recreate HVA->HPA
mappings? If I am understanding the CXL 3.0 spec correctly, then both
virtual addresses and physical addresses wouldn't have to change. Because
the fabric "virtualizes" host physical addresses and the translation is done
by the G-FAM/GFD that has the capability to translate multi-host HPAs to
it's internal DPAs. So if you have two hypervisors seeing device physical
address as the same physical address, that might work?

Hm.  I hadn't considered the device side translation (decoders), though
that's obviously a tool in the toolbox.  You still have to know how to
slide ranges of data (which you mention below).

Hmm, do you have any quick thoughts on that?

The reference in this case is... the page tables.  You need to know how
to interpret the data in the CXL memory region on the remote host, and
that's a "relative page table translation" (to coin a phrase? I'm not
sure how to best describe it).

right, coining phrases... I have been thinking of a "super-page" (for the
lack of a better word) a metadata region sitting on the switched CXL.mem
device that would allow hypervisors to synchronize on various aspects, such
as "relative page table translation", host is up, host is down, list of
peers, who owns what etc... In a perfect scenario, I would love to see the
hypervisors cooperating on switched CXL.mem device the same way cpus on
different numa nodes cooperate on memory in a single hypervisor. If either
host can allocate and schedule from this space then "NIL" aspect of
migration is "free".

The core of the problem is still that each of the hosts has to agree on
the location (physically) of this region of memory, which could be
problematic unless you have very strong BIOS and/or kernel driver
controls to ensure certain devices are guaranteed to be mapped into
certain spots in the CFMW.

Right, true. The way I am thinking of it is that this would be a part of 
data-center ops setup which at first pass would be a somewhat of a 
manual setup same way as other pre-OS related setup. But later on down 
the road perhaps this could be automated, either through some pre-agreed 
auto-ranges detection or similar, it's not unusual for dc ops to name 
hypervisors depending of where in dc/rack/etc they sit etc..

After that it's a matter of treating this memory as incoherent shared
memory and handling ownership in a safe way.  If the memory is only used
for migrations, then you don't have to worry about performance.

So I agree, as long as shared memory mapped into the same CFMW area is
used, this mechanism is totally sound.

My main concerns are that I don't know of a mechanism to ensure that.  I
suppose for those interested, and with special BIOS/EFI, you could do
that - but I think that's going to be a tall ask in a heterogenous cloud
environment.

Yeah, I get that. But in my experience even heterogeneous setups have 
some level of homogeneity, weather it's per rack, or per pod. As old 
things are sunset and new things are brought in, it gives you these 
segments of homogeneity with more or less advanced features. So at the 
end of the day, if someone wants a feature X they will need to 
understand the feature requirements or limitations. I feel like I deal 
with hardware/feature fragmentation all the time, but doesn't preclude 
bringing newer things in. You just have to plant it appropriately.

That's... complicated to say the least.

<... snip ...>

An Option:  Make pages physically contiguous on migration to CXL

In this case, you don't necessarily care about the Host Virtual
Addresses, what you actually care about are the structure of the pages
in memory (are they physically contiguous? or do you need to
reconstruct the contiguity by inspecting the page tables?).

If a migration API were capable of reserving large swaths of contiguous
CXL memory, you could discard individual page information and instead
send page range information, reconstructing the virtual-physical
mappings this way.

yeah, good points, but this is all tricky though... it seems this would
require quiescing the VM and that is something I would like to avoid if
possible. I'd like to see the VM still executing while all of it's pages are
migrated onto CXL NUMA on the source hypervisor. And I would like to see the
VM executing on the destination hypervisor while migrate_pages is moving
pages off of CXL. Of course, what you are describing above would still be a
very fast VM migration, but would require quiescing.

Possibly.  If you're going to quiesce you're probably better off just
snapshotting to shared memory and migrating the snapshot.

That is exactly my thought too.

Maybe that's the better option for a first-pass migration mechanism.  I
don't know.

I definitely see your point, "canning" and "re-hydration" approach as a 
first-pass. I'd be happy with even just a "Hello World" page migration 
as a first pass :)

Anyway, would love to attend this session.

~Gregory

--
--
Peace can only come as a natural consequence
of universal enlightenment -Dr. Nikola Tesla

Re: [LSF/MM/BPF TOPIC] BoF VM live migration over CXL memory​

Re: [LSF/MM/BPF TOPIC] BoF VM live migration over CXL memory