On Wed, Jan 13, 2016 at 07:29:46AM -0500, John Ferlan wrote: > Reposting my cgroup fixes series: > > http://www.redhat.com/archives/libvir-list/2016-January/msg00236.html > > partially because I originally forgot to CC the author (Henning Schild) > of the original series for which these patch fix a couple of issues > discovered during regression testing (virt-test memtune failures in > Red Hat regression environment), but also to bring them up to date > with the top of libvirt git. > > NB: I did send Henning the changes after the fact, but my resend using > the same message-id skills so that replies are left in the onlist series > are lacking. Henning has looked at the first patch - with a response here: > > http://www.redhat.com/archives/libvir-list/2016-January/msg00443.html > > Finally, I think these changes should go into 1.3.1 since that's when the > regression was introduced. Since this has been puzzelling us for a while, let me recap on the cgroup setup in general. First, I'll describe how it used to work *before* Henning's patches were merged, on a systemd based host. - The QEMU driver forks a child process, but does *not* exec QEMU yet The cgroup placement at this point is inherited from libvirtd. It may look like this: 10:freezer:/ 9:cpuset:/ 8:perf_event:/ 7:hugetlb:/ 6:blkio:/system.slice 5:memory:/system.slice 4:net_cls,net_prio:/ 3:devices:/system.slice/libvirtd.service 2:cpu,cpuacct:/system.slice 1:name=systemd:/system.slice/libvirtd.service - The QEMU driver calls virCgroupNewMachine() - We calll virSystemdCreateMachine with pidleader=$child - Systemd creates the initial machine scope unit under the machine slice unit, for the "systemd" controller. It may also add the PID to *zero* or more other resource controllers. So at this point the cgroup placement may look like this: 10:freezer:/ 9:cpuset:/ 8:perf_event:/ 7:hugetlb:/ 6:blkio:/ 5:memory:/ 4:net_cls,net_prio:/ 3:devices:/ 2:cpu,cpuacct:/ 1:name=systemd:/machine.slice/machine-qemu\x2dserial.scope Or may look like this: 10:freezer:/machine.slice/machine-qemu\x2dserial.scope 9:cpuset:/machine.slice/machine-qemu\x2dserial.scope 8:perf_event:/machine.slice/machine-qemu\x2dserial.scope 7:hugetlb:/machine.slice/machine-qemu\x2dserial.scope 6:blkio:/machine.slice/machine-qemu\x2dserial.scope 5:memory:/machine.slice/machine-qemu\x2dserial.scope 4:net_cls,net_prio:/machine.slice/machine-qemu\x2dserial.scope 3:devices:/machine.slice/machine-qemu\x2dserial.scope 2:cpu,cpuacct:/machine.slice/machine-qemu\x2dserial.scope 1:name=systemd:/machine.slice/machine-qemu\x2dserial.scope Or anywhere in between. We have *ZERO* guarantee about what other resource controllers we may have been placed in by systemd. There is some fairly complex logic that determines this, based on what other tasks current exist in sibling cgroups, and what tasks have *previously* existed in the cgroups. IOW, you should consider the list of etra resource controllers essentially non-deterministic - We call virCgroupAddTask with pid=$child This places the pid in any resource controllers we need, which systemd has not already setup. IOW, it guarantees that we now have placement that should look like this, regardless of what systemd has done: 10:freezer:/machine.slice/machine-qemu\x2dserial.scope 9:cpuset:/machine.slice/machine-qemu\x2dserial.scope 8:perf_event:/machine.slice/machine-qemu\x2dserial.scope 7:hugetlb:/machine.slice/machine-qemu\x2dserial.scope 6:blkio:/machine.slice/machine-qemu\x2dserial.scope 5:memory:/machine.slice/machine-qemu\x2dserial.scope 4:net_cls,net_prio:/machine.slice/machine-qemu\x2dserial.scope 3:devices:/machine.slice/machine-qemu\x2dserial.scope 2:cpu,cpuacct:/machine.slice/machine-qemu\x2dserial.scope 1:name=systemd:/machine.slice/machine-qemu\x2dserial.scope - The QEMU driver now lets the child process exec QEMU. QEMU creates its vCPU threads at this point. All QEMU threads (emulator, vcpu and I/O threads) now have the cgroup placement shown above. - We create the emulator cgroup for the cpuset, cpu, cpuacct controllers move all threads into this new cgroup. All threads (emulator, vcpu and I/O threads) thus now have placement of: 10:freezer:/machine.slice/machine-qemu\x2dserial.scope 9:cpuset:/machine.slice/machine-qemu\x2dserial.scope/emulator 8:perf_event:/machine.slice/machine-qemu\x2dserial.scope 7:hugetlb:/machine.slice/machine-qemu\x2dserial.scope 6:blkio:/machine.slice/machine-qemu\x2dserial.scope 5:memory:/machine.slice/machine-qemu\x2dserial.scope 4:net_cls,net_prio:/machine.slice/machine-qemu\x2dserial.scope 3:devices:/machine.slice/machine-qemu\x2dserial.scope 2:cpu,cpuacct:/machine.slice/machine-qemu\x2dserial.scope/emulator 1:name=systemd:/machine.slice/machine-qemu\x2dserial.scope Yes, we really did move the vcpu threads into the emulator group... - We now ask QEMU which are the vCPU & I/O threads. - Foreach CPU thread we new vCPU cgroups and move them into this place 10:freezer:/machine.slice/machine-qemu\x2dserial.scope 9:cpuset:/machine.slice/machine-qemu\x2dserial.scope/vcpuN 8:perf_event:/machine.slice/machine-qemu\x2dserial.scope 7:hugetlb:/machine.slice/machine-qemu\x2dserial.scope 6:blkio:/machine.slice/machine-qemu\x2dserial.scope 5:memory:/machine.slice/machine-qemu\x2dserial.scope 4:net_cls,net_prio:/machine.slice/machine-qemu\x2dserial.scope 3:devices:/machine.slice/machine-qemu\x2dserial.scope 2:cpu,cpuacct:/machine.slice/machine-qemu\x2dserial.scope/vpuN 1:name=systemd:/machine.slice/machine-qemu\x2dserial.scope - Foreach I/O thread we new vCPU cgroups and move them into this place 10:freezer:/machine.slice/machine-qemu\x2dserial.scope 9:cpuset:/machine.slice/machine-qemu\x2dserial.scope/iothreadN 8:perf_event:/machine.slice/machine-qemu\x2dserial.scope 7:hugetlb:/machine.slice/machine-qemu\x2dserial.scope 6:blkio:/machine.slice/machine-qemu\x2dserial.scope 5:memory:/machine.slice/machine-qemu\x2dserial.scope 4:net_cls,net_prio:/machine.slice/machine-qemu\x2dserial.scope 3:devices:/machine.slice/machine-qemu\x2dserial.scope 2:cpu,cpuacct:/machine.slice/machine-qemu\x2dserial.scope/iothreadN 1:name=systemd:/machine.slice/machine-qemu\x2dserial.scope Now, lets Henning's three patches commit 71ce475967b3523882b0e1a0f649fdbea5abf9d5 Author: Henning Schild <henning.schild@xxxxxxxxxxx> Date: Wed Dec 9 17:58:10 2015 +0100 util: cgroups do not implicitly add task to new machine cgroup This alters virCgroupNewMachine so that it no longer calls virCgroupAddTask. instead the callers are responsible for doing it. This is a functional no-opp change, so ostensibly harmless at this point. The second patch commit a41c00b472efaa192d2deae51ab732e65903238f Author: Henning Schild <henning.schild@xxxxxxxxxxx> Date: Mon Dec 14 15:48:05 2015 -0500 qemu: do not put a task into machine cgroup This stops qemuInitCgroup from calling virCgroupAddTask. Instead it has added a call to vriCgroupAddTask in qemuSetupCgroupForEmulator(), which only affects the cpu, cpuset & cpuacct controllers. This means we no longer have any guarantee about which resource controllers the QEMU processes in general are in. All we can say is that they are in the 'systemd' controller, and the cpu, cpuset & cpuacct controllers. Systemd may have also placed it into *all* the other resource controllers, or it may have placed it into none of the them. As such, after this change, we are potentially not in the correct cgroup for the memory, blkio, netcls, devices controllers. For added fun, the qemuSetupCgroupForEmulator() still has a call to virCgroupMoveTask, so the addition of virCgroupAddTask to this method was useless. The final patch commit 90b721e43ec9232b5b218e891437bed04548e841 Author: Henning Schild <henning.schild@xxxxxxxxxxx> Date: Mon Dec 14 15:58:05 2015 -0500 qemu cgroups: move new threads to new cgroup after cpuset is set up This delays the point at which we call virCgroupAddTask for the vCPU and I/O thread cgroups, until after we have configured properties on those cgroups. This change is fine So I think we need to revert the first 2 of Hennings patches - the 2nd one is the real serious problem, but once we revert that, the 1st change becomes pointless and so should also be reverted. Going back to the key problem Henning was trying to address.... The issue is that we only setup the emulator cgroup /after/ QEMU has already started running, so there's a period of time where threads are not confined by the cgroup affinity. I think we can solve that more simply by just moving the call to qemuSetupCgroupForEmulator(), into qemuSetupCgroup(). That way we place the emulator thread into the correct cgroup *before* we even exec QEMU. So we never have the window where we run in the unconfined cpuset controller. Regards, Daniel -- |: http://berrange.com -o- http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :| -- libvir-list mailing list libvir-list@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/libvir-list