At the moment, guests that are backed by hugepages in the host are only able to use policy to control the placement of those hugepages on a per-(guest-)CPU basis. Policy applied globally is ignored. Such guests would use <memoryBacking><hugepages/></memoryBacking> and a <numatune> block with <memory mode=... nodeset=.../> but no <memnode .../> elements. This patch corrects this by, in this specific case, changing the QEMU command line from "-mem-prealloc -mem-path=..." (which cannot specify NUMA policy) to "-object memory-backend-file ..." (which can). Note: This is not visible to the guest and does not appear to create a migration incompatibility. Signed-off-by: Sam Bobroff <sam.bobroff@xxxxxxxxxxx> --- Hello libvirt community, This patch is a RFC of an attempt to understand and fix a problem with NUMA memory placement policy being ignored in a specific situation. Previous discussion on this issue: https://www.redhat.com/archives/libvir-list/2016-October/msg00033.html https://www.redhat.com/archives/libvir-list/2016-October/msg00514.html (Sorry, this is a long description because the situation is complicated and there are several ways of approaching it and I need to explain my reasoning for choosing this specific method.) The issue I'm trying to fix: The problem is apparent when using this combination of features: * The guest is backed by hugepages in the host. * The backing memory is constrained by NUMA memory placement policy. * The guest either does not use NUMA nodes at all, or does but does not specify a per-node policy. The guest XML would therefore contain blocks similar to the following: <memoryBacking><hugepages/></memoryBacking> <numatune><memory mode='strict' nodeset='0'/></numatune> (And does not use <numatune><memnode ..>... .) (Policy works correctly without hugepages, or when memnode is used.) What happens is that the guest runs successfully but the NUMA memory placement policy is ignored, allowing memory to be allocated on any (host) node. Attempted solutions: The most obvious approach to me is a trivial patch to qemuBuildMemoryBackendStr(). The test that checks to see if a memory-backend is necessary (around line 3332 in qemu_command.c) is this: /* If none of the following is requested... */ if (!needHugepage && !userNodeset && !memAccess && !nodeSpecified && !force) { /* report back that using the new backend is not necessary * to achieve the desired configuration */ ret = 1; } else { This test does not consider the nodemask, so when no other reason exists to use a memory-backend (which is the case in point) the policy is lost. Adding nodemask to this test appears to easily fixe the issue. However: * It only works when the guest uses NUMA nodes (because otherwise the whole function is skipped at a higher level). * When the guest does use NUMA nodes, it causes a memory-backend object to be added to each one (when they did not previously have one) and this causes migration incompatability. So it seems worthwhile to look for other options. Other ways of controlling NUMA memory placement are numactl (libnuma) or cgroups but in practice neither can work in this situation due to the interaction of several features: * numactl and cgroups can only apply policy to all memory allocated by that thread. * When a guest is backed by hugepages, memory needs to be preallcated. * QEMU performs memory pre-allocation from it's main thread, not the VCPU threads. This seems to mean that policy applied to the QEMU VCPU threads does not apply to pre-allocated memory and policy applied to the QEMU main thread incorrectly applies to all of QEMU's allocations, not just the VCPU ones. My testing seems to confirm this. If that is true, then neither of those approaches can work (at least, not without changes to QEMU). At this point it appeared that the only ways to proceed would all require a migration-breaking change, so I started to look at QEMU to see if that could be worked around somehow. ... another option: While investigating QEMU I discovered that it is possible to use a QEMU memory backend object without attaching it to any guest NUMA node and when it is used that way it does not break migration. This is (to me at least) a new way of using a memory-backend object, but it appears to directly replace -mem-path and -mem-prealloc and additionally allow NUMA memory placement policy to be specified. (The ID field is still required even though it is not used.) For example, this... -m 4G -mem-path /dev/hugepages -mem-prealloc ... seems to be equavalent to this: -m 4G -object memory-backend-file,id=xxx,prealloc=yes,path=/dev/hugepages,size=4G But this is now possible as well: -m 4G -object memory-backend-file,id=xxx,prealloc=yes,path=/dev/hugepages,host-nodes=0-1,size=4G ... so it seems to be a way of solving this issue without breaking migration. Additionally, there seems to be code specifically in QEMU to prevent the migration data from changing in this case, which leads me to believe that this is an intended use of the option but I haven't directly consulted with the QEMU community. Implementation: My patch is based on this last case. It is intended to work by switching the "-mem-path" style argument to an un-attached memory-backend-file, which is why the change is done within qemuBuildMemPathStr(). In the case where QEMU doesn't support the needed option, I've opted to raise a warning rather than an error, because existing guests may have this configuration and although their policy is being ignored they do at least run and I assume it's important to maintain this. I realize that this policy can't be changed at run-time by libvirt, unlike policy set by cgroups or libnuma, but memory-backend objects are already in use in more common situations, and already have this limitation so I don't think this is adding significantly to that problem. Questions: So does this line of reasoning make sense? Is this a good way to fix it? Are there better solutions? Are there concerns with using memory-backend in this way: do I need to consult with the QEMU community before a libvirt patch that uses it could be accepted? Should the warning be an error? It seemed safest to me to only use the backend object when necessary, but there doesn't seem to be a reason to avoid using it whenever it is available. Is there one? Should I have done that instead? Thanks for reading! Sam. v2: Incorporated review feedback from Peter Krempa <pkrempa@xxxxxxxxxx> and Martin Kletzander <mkletzan@xxxxxxxxxx>. src/qemu/qemu_command.c | 56 +++++++++++++++++++++++++++++++++++++++++-------- 1 file changed, 47 insertions(+), 9 deletions(-) diff --git a/src/qemu/qemu_command.c b/src/qemu/qemu_command.c index 405e118..0dea9fd 100644 --- a/src/qemu/qemu_command.c +++ b/src/qemu/qemu_command.c @@ -7105,12 +7105,18 @@ qemuBuildSmpCommandLine(virCommandPtr cmd, static int qemuBuildMemPathStr(virQEMUDriverConfigPtr cfg, - const virDomainDef *def, + virDomainDefPtr def, virQEMUCapsPtr qemuCaps, - virCommandPtr cmd) + virCommandPtr cmd, + virBitmapPtr auto_nodeset) { const long system_page_size = virGetSystemPageSizeKB(); char *mem_path = NULL; + virBitmapPtr nodemask = NULL; + const char *backendType = NULL; + char *backendStr = NULL; + virJSONValuePtr props = NULL; + int rv = -1; /* * No-op if hugepages were not requested. @@ -7135,18 +7141,50 @@ qemuBuildMemPathStr(virQEMUDriverConfigPtr cfg, if (qemuGetHupageMemPath(cfg, def->mem.hugepages[0].size, &mem_path) < 0) return -1; - virCommandAddArgList(cmd, "-mem-prealloc", "-mem-path", mem_path, NULL); + if (virDomainNumatuneMaybeGetNodeset(def->numa, auto_nodeset, + &nodemask, -1) < 0) + return -1; + if (nodemask && virQEMUCapsGet(qemuCaps, QEMU_CAPS_OBJECT_MEMORY_FILE)) { + if (qemuBuildMemoryBackendStr(virDomainDefGetMemoryInitial(def), + 0, -1, NULL, auto_nodeset, + def, qemuCaps, cfg, &backendType, + &props, false) < 0) + goto cleanup; + /* The memory-backend object created here is not going to be + * attached anywhere and there is only one so the ID can be hard coded. + * It acts like -mem-prealloc -mem-path ... but allows policy to be + * set. */ + if (!(backendStr = virQEMUBuildObjectCommandlineFromJSON(backendType, + "mem", + props))) + goto cleanup; + virCommandAddArgList(cmd, "-object", backendStr, NULL); + rv = 0; + cleanup: + virJSONValueFree(props); + VIR_FREE(backendStr); + } else { + if (nodemask) + /* Ideally this would be an error, but if it was it would + * break existing guests already using this configuration. */ + VIR_WARN("Memory file backend objects are " + "unsupported by QEMU binary. Global NUMA " + "hugepage policy will be ignored."); + virCommandAddArgList(cmd, "-mem-prealloc", "-mem-path", mem_path, NULL); + rv = 0; + } VIR_FREE(mem_path); - return 0; + return rv; } static int qemuBuildMemCommandLine(virCommandPtr cmd, virQEMUDriverConfigPtr cfg, - const virDomainDef *def, - virQEMUCapsPtr qemuCaps) + virDomainDefPtr def, + virQEMUCapsPtr qemuCaps, + virBitmapPtr auto_nodeset) { if (qemuDomainDefValidateMemoryHotplug(def, qemuCaps, NULL) < 0) return -1; @@ -7170,7 +7208,7 @@ qemuBuildMemCommandLine(virCommandPtr cmd, * there is no numa node specified. */ if (!virDomainNumaGetNodeCount(def->numa) && - qemuBuildMemPathStr(cfg, def, qemuCaps, cmd) < 0) + qemuBuildMemPathStr(cfg, def, qemuCaps, cmd, auto_nodeset) < 0) return -1; if (def->mem.locked && !virQEMUCapsGet(qemuCaps, QEMU_CAPS_REALTIME_MLOCK)) { @@ -7307,7 +7345,7 @@ qemuBuildNumaArgStr(virQEMUDriverConfigPtr cfg, } if (!needBackend && - qemuBuildMemPathStr(cfg, def, qemuCaps, cmd) < 0) + qemuBuildMemPathStr(cfg, def, qemuCaps, cmd, auto_nodeset) < 0) goto cleanup; for (i = 0; i < ncells; i++) { @@ -9385,7 +9423,7 @@ qemuBuildCommandLine(virQEMUDriverPtr driver, if (!migrateURI && !snapshot && qemuDomainAlignMemorySizes(def) < 0) goto error; - if (qemuBuildMemCommandLine(cmd, cfg, def, qemuCaps) < 0) + if (qemuBuildMemCommandLine(cmd, cfg, def, qemuCaps, nodeset) < 0) goto error; if (qemuBuildSmpCommandLine(cmd, def) < 0) -- 2.10.0.297.gf6727b0 -- libvir-list mailing list libvir-list@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/libvir-list