On Thu, Oct 13, 2016 at 11:34:43AM +0200, Martin Kletzander wrote: > On Thu, Oct 13, 2016 at 11:34:16AM +1100, Sam Bobroff wrote: > >On Wed, Oct 12, 2016 at 10:27:50AM +0200, Martin Kletzander wrote: > >>On Wed, Oct 12, 2016 at 03:04:53PM +1100, Sam Bobroff wrote: > >>>At the moment, guests that are backed by hugepages in the host are > >>>only able to use policy to control the placement of those hugepages > >>>on a per-(guest-)CPU basis. Policy applied globally is ignored. > >>> > >>>Such guests would use <memoryBacking><hugepages/></memoryBacking> and > >>>a <numatune> block with <memory mode=... nodeset=.../> but no <memnode > >>>.../> elements. > >>> > >>>This patch corrects this by, in this specific case, changing the QEMU > >>>command line from "-mem-prealloc -mem-path=..." (which cannot > >>>specify NUMA policy) to "-object memory-backend-file ..." (which can). > >>> > >>>Note: This is not visible to the guest and does not appear to create > >>>a migration incompatibility. > >>> > >> > >>It could make sense, I haven't tried yet, though. However, I still > >>don't see the point in using memory-backend-file. Is it that when you > >>don't have cpuset cgroup the allocation doesn't work well? Because it > >>certainly does work for me. > > > >Thanks for taking a look at this :-) > > > >The point of using a memory-backend-file is that with it, the NUMA policy can > >be specified to QEMU, but with -mem-path it can't. It seems to be a way to tell > >QEMU to apply NUMA policy in the right place. It does seem odd to me to use > >memory-backend-file without attaching the backend to a guest NUMA node, but it > >seems to do the right thing in this case. (If there are guest NUMA nodes, or if > >hugepages aren't being used, policy is correctly applied.) > > > >I'll describe my test case in detail, perhaps there's something I don't understand > >happening. > > > >* I set up a machine with two (fake) NUMA nodes (0 and 1), with 2G of hugepages > > on node 1, and none on node 0. > > > >* I create a 2G guest using virt-install: > > > >virt-install --name ppc --memory=2048 --disk ~/tmp/tmp.qcow2 --cdrom ~/tmp/ubuntu-16.04-server-ppc64el.iso --wait 0 --virt-type qemu --memorybacking hugepages=on --graphics vnc --arch ppc64le > > > >* I "virsh destroy" and then "virsh edit" to add this block to the guest XML: > > > > <numatune> > > <memory mode='strict' nodeset='0'/> > > </numatune> > > > >* "virsh start", and the machine starts (I believe it should fail due to insufficient memory satasfying the policy). > >* "numastat -p $(pidof qemu-system-ppc64)" shows something like this: > > > >Per-node process memory usage (in MBs) for PID 8048 (qemu-system-ppc) > > Node 0 Node 1 Total > > --------------- --------------- --------------- > >Huge 0.00 2048.00 2048.00 > >Heap 8.12 0.00 8.12 > >Stack 0.03 0.00 0.03 > >Private 35.80 6.10 41.90 > >---------------- --------------- --------------- --------------- > >Total 43.95 2054.10 2098.05 > > > >So it looks like it's allocated hugepages from node 1, isn't this violating the > >policy I set via numatune? > > > > Oh, now I get it. We are doing our best to apply that policy to qemu > even when we don't have this option. However, using this works even > better (which is probably* what we want). And that's the reasoning > behind this. > > * I'm saying probably because when I was adding numactl binding to be > used together with cgroups, I was told that we couldn't change the > binding afterwards and it's bad. I feel like we could do something > with that and it would help us in the future, but there needs to be a > discussion, I guess. Because I might be one of the few =) > > So to recapitulate that, there are three options how to affect the > allocation of qemu's memory: > > 1) numactl (libnuma): it works as expected, but cannot be changed later > > 2) cgroups: so strict it has to be applied after qemu started, due to > that it doesn't work right, especially for stuff that gets all > pre-allocated (like hugepages). it can be changed later, but it > won't always mean the memory will migrate, so upon change there is > no guarantee. If it's unavailable, we fallback to (1) anyway > > 3) memory-backing-file's host-nodes=: this works as expected, but > cannot be used with older QEMUs, cannot be changed later and in some > cases (not your particular one) it might screw up migration if it > wasn't used before. > > Selecting the best option from these, plus making the code work with > every possibility (erroring out when you want to change the memory node > and we had to use (1) for example) is a pain. We should really think > about that and reorganize these things for the better of the future. > Otherwise we're going to get overwhelm ourselves. Cc'ing Peter to get > his thoughts as well as he worked on some parts of this as well. > > Martin Thanks for the explanation, and I agree (I'm already a bit overwhelmed!) :-) What do you mean by "changed later"? Do you mean, if the domain XML is changed while the machine is running? I did look at the libnuma and cgroups approaches, but I was concerned they wouldn't work in this case, because of the way QEMU allocates memory when mem-prealloc is used: the memory is allocated in the main process, before the CPU threads are created. (This is based only on a bit of hacking and debugging in QEMU, but it does seem explain the behaviour I've seen so far.) If this is the case, it would seem to be a significant problem: if policy is set on the main thread, it will affect all allocations not just the VCPU memory and if it's set on the VCPU threads it won't catch the pre-allocation at all. (Is this what you were referring to by "it doesn't work right"?) That was my reasoning for trying to use the backend object in this case; it was the only method that worked and did not require changes to QEMU. I'd prefer the other approaches if they could be made to work. I think QEMU could be altered to move the preallocations into the VCPU threads but it didn't seem trivial and I suspected the QEMU community would point out that there was already a way to do it using backend objects. Another option would be to add a -host-nodes parameter to QEMU so that the policy can be given without adding a memory backend object. (That seems like a more reasonable change to QEMU.) Cheers, Sam. -- libvir-list mailing list libvir-list@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/libvir-list