Re: [PATCH 1/1] qemu: host NUMA hugepage policy without guest NUMA

Sam Bobroff <sam.bobroff@xxxxxxxxxxx> · Fri, 14 Oct 2016 11:52:22 +1100

On Thu, Oct 13, 2016 at 11:34:43AM +0200, Martin Kletzander wrote:
> On Thu, Oct 13, 2016 at 11:34:16AM +1100, Sam Bobroff wrote:
> >On Wed, Oct 12, 2016 at 10:27:50AM +0200, Martin Kletzander wrote:
> >>On Wed, Oct 12, 2016 at 03:04:53PM +1100, Sam Bobroff wrote:
> >>>At the moment, guests that are backed by hugepages in the host are
> >>>only able to use policy to control the placement of those hugepages
> >>>on a per-(guest-)CPU basis. Policy applied globally is ignored.
> >>>
> >>>Such guests would use <memoryBacking><hugepages/></memoryBacking> and
> >>>a <numatune> block with <memory mode=... nodeset=.../> but no <memnode
> >>>.../> elements.
> >>>
> >>>This patch corrects this by, in this specific case, changing the QEMU
> >>>command line from "-mem-prealloc -mem-path=..." (which cannot
> >>>specify NUMA policy) to "-object memory-backend-file ..." (which can).
> >>>
> >>>Note: This is not visible to the guest and does not appear to create
> >>>a migration incompatibility.
> >>>
> >>
> >>It could make sense, I haven't tried yet, though.  However, I still
> >>don't see the point in using memory-backend-file.  Is it that when you
> >>don't have cpuset cgroup the allocation doesn't work well?  Because it
> >>certainly does work for me.
> >
> >Thanks for taking a look at this :-)
> >
> >The point of using a memory-backend-file is that with it, the NUMA policy can
> >be specified to QEMU, but with -mem-path it can't. It seems to be a way to tell
> >QEMU to apply NUMA policy in the right place. It does seem odd to me to use
> >memory-backend-file without attaching the backend to a guest NUMA node, but it
> >seems to do the right thing in this case. (If there are guest NUMA nodes, or if
> >hugepages aren't being used, policy is correctly applied.)
> >
> >I'll describe my test case in detail, perhaps there's something I don't understand
> >happening.
> >
> >* I set up a machine with two (fake) NUMA nodes (0 and 1), with 2G of hugepages
> > on node 1, and none on node 0.
> >
> >* I create a 2G guest using virt-install:
> >
> >virt-install --name ppc --memory=2048 --disk ~/tmp/tmp.qcow2 --cdrom ~/tmp/ubuntu-16.04-server-ppc64el.iso --wait 0 --virt-type qemu --memorybacking hugepages=on --graphics vnc --arch ppc64le
> >
> >* I "virsh destroy" and then "virsh edit" to add this block to the guest XML:
> >
> > <numatune>
> >    <memory mode='strict' nodeset='0'/>
> > </numatune>
> >
> >* "virsh start", and the machine starts (I believe it should fail due to insufficient memory satasfying the policy).
> >* "numastat -p $(pidof qemu-system-ppc64)" shows something like this:
> >
> >Per-node process memory usage (in MBs) for PID 8048 (qemu-system-ppc)
> >                          Node 0          Node 1           Total
> >                 --------------- --------------- ---------------
> >Huge                         0.00         2048.00         2048.00
> >Heap                         8.12            0.00            8.12
> >Stack                        0.03            0.00            0.03
> >Private                     35.80            6.10           41.90
> >----------------  --------------- --------------- ---------------
> >Total                       43.95         2054.10         2098.05
> >
> >So it looks like it's allocated hugepages from node 1, isn't this violating the
> >policy I set via numatune?
> >
> 
> Oh, now I get it.  We are doing our best to apply that policy to qemu
> even when we don't have this option.  However, using this works even
> better (which is probably* what we want).  And that's the reasoning
> behind this.
> 
> * I'm saying probably because when I was adding numactl binding to be
>   used together with cgroups, I was told that we couldn't change the
>   binding afterwards and it's bad.  I feel like we could do something
>   with that and it would help us in the future, but there needs to be a
>   discussion, I guess.  Because I might be one of the few =)
> 
> So to recapitulate that, there are three options how to affect the
> allocation of qemu's memory:
> 
> 1) numactl (libnuma): it works as expected, but cannot be changed later
> 
> 2) cgroups: so strict it has to be applied after qemu started, due to
>    that it doesn't work right, especially for stuff that gets all
>    pre-allocated (like hugepages).  it can be changed later, but it
>    won't always mean the memory will migrate, so upon change there is
>    no guarantee.  If it's unavailable, we fallback to (1) anyway
> 
> 3) memory-backing-file's host-nodes=: this works as expected, but
>    cannot be used with older QEMUs, cannot be changed later and in some
>    cases (not your particular one) it might screw up migration if it
>    wasn't used before.
> 
> Selecting the best option from these, plus making the code work with
> every possibility (erroring out when you want to change the memory node
> and we had to use (1) for example) is a pain.  We should really think
> about that and reorganize these things for the better of the future.
> Otherwise we're going to get overwhelm ourselves.  Cc'ing Peter to get
> his thoughts as well as he worked on some parts of this as well.
> 
> Martin

Thanks for the explanation, and I agree (I'm already a bit overwhelmed!) :-)

What do you mean by "changed later"? Do you mean, if the domain XML is changed
while the machine is running?

I did look at the libnuma and cgroups approaches, but I was concerned they
wouldn't work in this case, because of the way QEMU allocates memory when
mem-prealloc is used: the memory is allocated in the main process, before the
CPU threads are created. (This is based only on a bit of hacking and debugging
in QEMU, but it does seem explain the behaviour I've seen so far.)

If this is the case, it would seem to be a significant problem: if policy is
set on the main thread, it will affect all allocations not just the VCPU
memory and if it's set on the VCPU threads it won't catch the pre-allocation at
all. (Is this what you were referring to by "it doesn't work right"?)

That was my reasoning for trying to use the backend object in this case; it was
the only method that worked and did not require changes to QEMU. I'd prefer
the other approaches if they could be made to work.

I think QEMU could be altered to move the preallocations into the VCPU
threads but it didn't seem trivial and I suspected the QEMU community would
point out that there was already a way to do it using backend objects.  Another
option would be to add a -host-nodes parameter to QEMU so that the policy can
be given without adding a memory backend object. (That seems like a more
reasonable change to QEMU.)

Cheers,
Sam.

--
libvir-list mailing list
libvir-list@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/libvir-list