Hi Gregory-
On 4/7/23 19:05, Gregory Price wrote:
On Fri, Apr 07, 2023 at 04:05:31PM -0500, Dragan Stancevic wrote:
Hi folks-
if it's not too late for the schedule...
I am starting to tackle VM live migration and hypervisor clustering over
switched CXL memory[1][2], intended for cloud virtualization types of loads.
I'd be interested in doing a small BoF session with some slides and get into
a discussion/brainstorming with other people that deal with VM/LM cloud
loads. Among other things to discuss would be page migrations over switched
CXL memory, shared in-memory ABI to allow VM hand-off between hypervisors,
etc...
A few of us discussed some of this under the ZONE_XMEM thread, but I figured
it might be better to start a separate thread.
If there is interested, thank you.
[1]. High-level overview available at http://nil-migration.org/
[2]. Based on CXL spec 3.0
--
Peace can only come as a natural consequence
of universal enlightenment -Dr. Nikola Tesla
I've been chatting about this with folks offline, figure i'll toss my
thoughts on the issue here.
excellent brain dump, thank you
Some things to consider:
1. If secure-compute is being used, then this mechanism won't work as
pages will be pinned, and therefore not movable and excluded from
using cxl memory at all.
This issue does not exist with traditional live migration, because
typically some kind of copy is used from one virtual space to another
(i.e. RMDA), so pages aren't really migrated in the kernel memory
block/numa node sense.
right, agreed... I don't think we can migrate in all scenarios, such as
pinning or forms of pass-through, etc
my opinion just to start off, as a base requirement, would be that the
pages be movable.
2. During the migration process, the memory needs to be forced not to be
migrated to another node by other means (tiering software, swap,
etc). The obvious way of doing this would be to migrate and
temporarily pin the page... but going back to problem #1 we see that
ZONE_MOVABLE and Pinning are mutually exclusive. So that's
troublesome.
Yeah, true. I'd have to check the code, but I wonder if perhaps we could
mapcount or refount the pages upon migration onto CLX switched memory.
If my memory serves me right, wouldn't the move_pages back off or stall?
I guess it's TBD, how workable or useful that would be but it's good to
be thinking of different ways of doing this
3. This is changing the semantics of migration from a virtual memory
movement to a physical memory movement. Typically you would expect
the RDMA process for live migration to work something like...
a) migration request arrives
b) source host informs destination host of size requirements
c) destination host allocations memory and passes a Virtual Address
back to source host
d) source host initates an RDMA from HostA-VA to HostB-VA
e) CPU task is migrated
Importantly, the allocation of memory by Host B handles the important
step of creating HVA->HPA mappings, and the Extended/Nested Page
Tables can simply be flushed and re-created after the VM is fully
migrated.
to long didn't read: live migration is a virtual address operation,
and node-migration is a PHYSICAL address operation, the virtual
addresses remain the same.
This is problematic, as it's changing the underlying semantics of the
migration operation.
Those are all valid points, but what if you don't need to recreate
HVA->HPA mappings? If I am understanding the CXL 3.0 spec correctly,
then both virtual addresses and physical addresses wouldn't have to
change. Because the fabric "virtualizes" host physical addresses and the
translation is done by the G-FAM/GFD that has the capability to
translate multi-host HPAs to it's internal DPAs. So if you have two
hypervisors seeing device physical address as the same physical address,
that might work?
Problem #1 and #2 are head-scratchers, but maybe solvable.
Problem #3 is the meat and potatos of the issue in my opinion. So lets
consider that a little more closely.
Generically: NIL Migration is basically a pass by reference operation.
Yup, agreed
The reference in this case is... the page tables. You need to know how
to interpret the data in the CXL memory region on the remote host, and
that's a "relative page table translation" (to coin a phrase? I'm not
sure how to best describe it).
right, coining phrases... I have been thinking of a "super-page" (for
the lack of a better word) a metadata region sitting on the switched
CXL.mem device that would allow hypervisors to synchronize on various
aspects, such as "relative page table translation", host is up, host is
down, list of peers, who owns what etc... In a perfect scenario, I would
love to see the hypervisors cooperating on switched CXL.mem device the
same way cpus on different numa nodes cooperate on memory in a single
hypervisor. If either host can allocate and schedule from this space
then "NIL" aspect of migration is "free".
That's... complicated to say the least.
1) Pages on the physical hardware do not need to be contiguous
2) The CFMW on source and target host do not need to be mapped at the
same place
3) There's not pre-allocation in these charts, and migration isn't
targeted, so having the source-host "expertly place" the data isn't
possible (right now, i suppose you could make kernel extensions).
4) Similar to problem #2 above, even with a pre-allocate added in, you
would need to ensure those mappings were pinned during migration,
lest the target host end up swapping a page or something.
An Option: Make pages physically contiguous on migration to CXL
In this case, you don't necessarily care about the Host Virtual
Addresses, what you actually care about are the structure of the pages
in memory (are they physically contiguous? or do you need to
reconstruct the contiguity by inspecting the page tables?).
If a migration API were capable of reserving large swaths of contiguous
CXL memory, you could discard individual page information and instead
send page range information, reconstructing the virtual-physical
mappings this way.
yeah, good points, but this is all tricky though... it seems this would
require quiescing the VM and that is something I would like to avoid if
possible. I'd like to see the VM still executing while all of it's pages
are migrated onto CXL NUMA on the source hypervisor. And I would like to
see the VM executing on the destination hypervisor while migrate_pages
is moving pages off of CXL. Of course, what you are describing above
would still be a very fast VM migration, but would require quiescing.
That's about as far as I've thought about it so far. Feel free to rip
it apart! :]
Those are all great thoughts and I appreciate you sharing them. I don't
have all the answers either :)
~Gregory
--
Peace can only come as a natural consequence
of universal enlightenment -Dr. Nikola Tesla