Once upon a time, there was a little domain. And the domain was pinned onto a NUMA node and hasn't fully allocated its memory: <memory unit='KiB'>2355200</memory> <currentMemory unit='KiB'>1048576</currentMemory> <numatune> <memory mode='strict' nodeset='0'/> </numatune> Oh little me, said the domain, what will I do with so few memory. If I only had a few megabytes more. But the old admin noticed whimpering, barely audible to untrained human ear. And good admin he was, he gave the domain yet more memory. But the old NUMA topology witch forbidden to allocate more memory on the node zero. So he decided to allocate it on a different node: virsh # numatune little_domain --nodeset 0-1 virsh # setmem little_domain 2355200 The little domain was happy. For a while. Until bad, sharp teeth shaped creature came. Every process in the system was afraid of him. The OOM Killer they called him. Oh no, he's after the little domain. There's no escape. Do you kids know why? Because when the little domain was born, her father, Libvirt, called numa_set_membind(). So even if the admin allowed her to allocate memory from other nodes in the cgroups, the membind() forbid it. So what's the lesson? Libvirt should rely on cgroups, whenever possible and use numa_set_membind() as the last ditch effort. Signed-off-by: Michal Privoznik <mprivozn@xxxxxxxxxx> --- src/qemu/qemu_process.c | 32 ++++++++++++++++++++++++++++---- 1 file changed, 28 insertions(+), 4 deletions(-) diff --git a/src/qemu/qemu_process.c b/src/qemu/qemu_process.c index 79f763e..cba042d 100644 --- a/src/qemu/qemu_process.c +++ b/src/qemu/qemu_process.c @@ -3154,6 +3154,7 @@ static int qemuProcessHook(void *data) int fd; virBitmapPtr nodeset = NULL; virDomainNumatuneMemMode mode; + bool doNuma = true; /* This method cannot use any mutexes, which are not * protected across fork() @@ -3185,11 +3186,34 @@ static int qemuProcessHook(void *data) goto cleanup; mode = virDomainNumatuneGetMode(h->vm->def->numa, -1); - nodeset = virDomainNumatuneGetNodeset(h->vm->def->numa, - priv->autoNodeset, -1); - if (virNumaSetupMemoryPolicy(mode, nodeset) < 0) - goto cleanup; + if (mode == VIR_DOMAIN_NUMATUNE_MEM_STRICT) { + virCgroupPtr cgroup = NULL; + + /* Create dummy cgroup ... */ + if (virCgroupNewSelf(&cgroup) < 0) + goto cleanup; + + /* ... just to detect if cpuset cgroup is present */ + if (virCgroupHasController(cgroup, VIR_CGROUP_CONTROLLER_CPUSET)) { + /* Because if it's not, we must use virNuma* APIs to bind + * memory onto desired nodes. CGroup way is preferred, as + * it allows runtime tuning, while virNuma - well, once + * set and child (qemu) is exec()-ed, we can't do + * anything about the settings. virNuma* does not take + * any PID argument after all. */ + doNuma = false; + } + virCgroupFree(&cgroup); + } + + if (doNuma) { + nodeset = virDomainNumatuneGetNodeset(h->vm->def->numa, + priv->autoNodeset, -1); + + if (virNumaSetupMemoryPolicy(mode, nodeset) < 0) + goto cleanup; + } ret = 0; -- 2.0.5 -- libvir-list mailing list libvir-list@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/libvir-list