Re: [Qemu-devel] [RFC PATCH] Exporting Guest RAM information for NUMA binding

Andrea Arcangeli <aarcange@xxxxxxxxxx> · Wed, 23 Nov 2011 16:03:00 +0100

Hi!

On Mon, Nov 21, 2011 at 07:51:21PM -0600, Anthony Liguori wrote:
> Fundamentally, the entity that should be deciding what memory should be present 
> and where it should located is the kernel.  I'm fundamentally opposed to trying 
> to make QEMU override the scheduler/mm by using cpu or memory pinning in QEMU.
> 
>  From what I can tell about ms_mbind(), it just uses process knowledge to bind 
> specific areas of memory to a memsched group and let's the kernel decide what to 
> do with that knowledge.  This is exactly the type of interface that QEMU should 
> be using.
> 
> QEMU should tell the kernel enough information such that the kernel can make 
> good decisions.  QEMU should not be the one making the decisions.

True, QEMU won't have to decide where the memory and vcpus should be
located (but hey it wouldn't need to decide that even if you use
cpusets, you can use relative mbind with cpusets, the admin or a
cpuset job scheduler could decide) but it's still QEMU making the
decision of what memory and which vcpus threads to
ms_mbind/ms_tbind. Think how you're going to create the input of those
syscalls...

If it wasn't qemu to decide that, qemu wouldn't be required to scan
the whole host physical numa (cpu/memory) topology in order to create
the "input" arguments of "ms_mbind/ms_tbind". And when you migrate the
VM to another host, the whole vtopology may be counter-productive
because the kernel isn't automatically detecting the numa affinity
between threads and the guest vtopology will stick to whatever numa
_physical_ topology that was seen on the first node where the VM was
created.

I doubt that the assumption that all cloud nodes will have the same
physical numa topology is reasonable.

Furthermore to get the same benefits that qemu gets on host by using
ms_mbind/ms_tbind, every single guest application should be modified
to scan the guest vtopology and call ms_mbind/ms_tbind too (or use the
hard bindings which is what we try to avoid).

I think it's unreasonable to expect all applications to use
ms_mbind/ms_tbind in the guest, at best guest apps will use cpusets or
wrappers, few apps will be modified for sys_ms_tbind/mbind.

You can always have the supercomputer case with just one app that is
optimized and a single VM spanning over the whole host, but in that
scenarios hard bindings would work perfectly too.

In my view the trouble of the numa hard bindings is not the fact
they're hard and qemu has to also decide the location (in fact it
doesn't need to decide the location if you use cpusets and relative
mbinds). The bigger problem is the fact either the admin or the app
developer has to explicitly scan the numa physical topology (both cpus
and memory) and tell the kernel how much memory to bind to each
thread. ms_mbind/ms_tbind only partially solve that problem. They're
similar to the mbind MPOL_F_RELATIVE_NODES with cpusets, except you
don't need an admin or a cpuset-job-scheduler (or a perl script) to
redistribute the hardware resources.

Now dealing with bindings isn't big deal for qemu, in fact this API is
pretty much ideal for qemu, but it won't make life substantially
easier than if compared to hard bindings. Simply the management code
that is now done with a perl script will have to be moved in the
kernel. It looks an incremental improvement compared to the relative
mbind+cpuset, but I'm unsure if it's the best we could aim for and
what we really need in virt considering we deal with VM migration too.

The real long term design to me is not to add more syscalls, and
initially handling the case of a process/VM spanning not more than one
node in thread number and amount of memory. That's not too hard an in
fact I've benchmarks for the scheduler already showing it to work
pretty well (it's creating a too strict affinity but it can be relaxed
to be more useful). Then later add some mechanism (simplest is the
page fault at low frequency) to create a
guest_vcpu_thread<->host_memory affinity and have a parvirtualized
interface that tells the guest scheduler to group CPUs.

If the guest scheduler runs free and is allowed to move threads
randomly without any paravirtualized interface that controls the CPU
thread migration in the guest scheduler, the thread<->memory affinity
on host will be hopeless. But with a parvirtualized interface to make
a guest thread stick to vcpu0/1/2/3 and not going into vcpu4/5/6/7,
will allow to create a more meaningful guest_thread<->physical_ram
affinity on host through KVM page faults. And then this will work also
with VM migration and without having to create a vtopology in guest.

And for apps running in guest no paravirt will be needed of course.

The reason paravirt would be needed for qemu-kvm with a full automatic
thread<->memory affinity is that the vcpu threads are magic. What runs
in the vcpu thread are guest threads. And those can move through the
guest CPU scheduler from vcpu0 to vcpu7. If that happens and we've 4
physical cpu for each physical node, any affinity we measure in the
host will be meaningless. Normal threads using NPTL won't behave like
that. Maybe some other thread library could have a "scheduler" inside
that would make it behave like a vcpu thread (it's one thread really
with several threads inside) but those existed mostly to simulate
multiple threads in a single thread so they don't matter. And in this
respect sys_tbind also requires the tid to have meaningful memory
affinity. sys_tbind/mbind gets away with it by creating a vtopology in
the guest, so the guest scheduler would then follow the vtopology (but
vtopology breaks across VM migration and to really be followed well
with sys_mbind/tbind it'd require all apps to be modified).

grouping guest threads to stick into some vcpu sounds immensely
simpler than changing the whole guest vtopology at runtime that would
involve changing memory layout too.

NOTE: the paravirt cpu grouping interface would also handle the case
of 3 guests of 2.5G on a 8G guest (4G per node). One of the three
guests will have memory spanning over two nodes, and the guest
vtopology created by sys_mbind/tbind can't handle it. While paravirt
cpu grouping and automatic thread<->memory affinity on host will
handle it, like it will handle VM migration across nodes with
different physical topology. The problem is to create a
thread<->memory affinity we'll have to issue some page fault in KVM in
the background. How harmful that is I don't know at this point. So the
full automatic thread<->memory affinity is a bit of a vapourware
concept at this point (process<->memory affinity seems to work already
though).

But Peter's migration code was driven by page faults already (not
included in the patch he posted) and the other patch that exists
called migrate-on-fault also depended on page faults. So I am
optimistic we could have a thread<->memory affinity working too in the
longer term. The plan would be to run them at low frequency and only
if we can't fit a process into one node (in terms of both number of
threads and memory). If the process fits in one node, we wouldn't even
need any page fault and the information in the pagetables will be
enough to do a best decision. The downside is it significantly more
difficult to implement the thread<->memory affinity. And that's why
I'm focusing initially on the simpler case of considering only the
process<->memory affinity. That's fairly easy.

So for the time being this incremental improvement may be justified,
it moves the logic from a perl script to the kernel but I'm just
skeptical it provides a big advantage compared to the numa bindings we
already have in the kernel, especially if in the long term we can get
rid of a vtopology completely.

The vtopology in the guest may seem appealing, it solves the problem
when you use bindings everywhere (be them hard bindings, or cpuset
relative bindings, or the dynamic sys_mbind/tbind). But there is no
much hope to alter the vtopology at runtime, so when a guest must be
split across two nodes (3 VM of 2.5G ram running in a 8G host with 2
4G nodes) or through VM migration across different cloud nodes, I
think the vtopology is trouble and would be best if it's avoided. The
memory side of the vtopology is absolute trouble if it doesn't match
the host physical topology exactly.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html