On Thu, 14 Jan 2016 11:57:44 +0000 "Daniel P. Berrange" <berrange@xxxxxxxxxx> wrote: > On Wed, Jan 13, 2016 at 07:29:46AM -0500, John Ferlan wrote: > > Reposting my cgroup fixes series: > > > > http://www.redhat.com/archives/libvir-list/2016-January/msg00236.html > > > > partially because I originally forgot to CC the author (Henning > > Schild) of the original series for which these patch fix a couple > > of issues discovered during regression testing (virt-test memtune > > failures in Red Hat regression environment), but also to bring them > > up to date with the top of libvirt git. > > > > NB: I did send Henning the changes after the fact, but my resend > > using the same message-id skills so that replies are left in the > > onlist series are lacking. Henning has looked at the first patch - > > with a response here: > > > > http://www.redhat.com/archives/libvir-list/2016-January/msg00443.html > > > > Finally, I think these changes should go into 1.3.1 since that's > > when the regression was introduced. > > Since this has been puzzelling us for a while, let me recap on the > cgroup setup in general. > > First, I'll describe how it used to work *before* Henning's patches > were merged, on a systemd based host. > > - The QEMU driver forks a child process, but does *not* exec QEMU yet > The cgroup placement at this point is inherited from libvirtd. It > may look like this: > > 10:freezer:/ > 9:cpuset:/ > 8:perf_event:/ > 7:hugetlb:/ > 6:blkio:/system.slice > 5:memory:/system.slice > 4:net_cls,net_prio:/ > 3:devices:/system.slice/libvirtd.service > 2:cpu,cpuacct:/system.slice > 1:name=systemd:/system.slice/libvirtd.service > > - The QEMU driver calls virCgroupNewMachine() > > - We calll virSystemdCreateMachine with pidleader=$child > > - Systemd creates the initial machine scope unit under > the machine slice unit, for the "systemd" controller. > It may also add the PID to *zero* or more other > resource controllers. So at this point the cgroup > placement may look like this: > > 10:freezer:/ > 9:cpuset:/ > 8:perf_event:/ > 7:hugetlb:/ > 6:blkio:/ > 5:memory:/ > 4:net_cls,net_prio:/ > 3:devices:/ > 2:cpu,cpuacct:/ > 1:name=systemd:/machine.slice/machine-qemu\x2dserial.scope > > Or may look like this: > > 10:freezer:/machine.slice/machine-qemu\x2dserial.scope > 9:cpuset:/machine.slice/machine-qemu\x2dserial.scope > 8:perf_event:/machine.slice/machine-qemu\x2dserial.scope > 7:hugetlb:/machine.slice/machine-qemu\x2dserial.scope > 6:blkio:/machine.slice/machine-qemu\x2dserial.scope > 5:memory:/machine.slice/machine-qemu\x2dserial.scope > 4:net_cls,net_prio:/machine.slice/machine-qemu\x2dserial.scope > 3:devices:/machine.slice/machine-qemu\x2dserial.scope > 2:cpu,cpuacct:/machine.slice/machine-qemu\x2dserial.scope > 1:name=systemd:/machine.slice/machine-qemu\x2dserial.scope > > Or anywhere in between. We have *ZERO* guarantee about > what other resource controllers we may have been placed in by > systemd. There is some fairly complex logic that > determines this, based on what other tasks current exist in sibling > cgroups, and what tasks have *previously* existed in the > cgroups. IOW, you should consider the list of etra > resource controllers essentially non-deterministic > > - We call virCgroupAddTask with pid=$child > > This places the pid in any resource controllers we need, which > systemd has not already setup. IOW, it guarantees that we now > have placement that should look like this, regardless of what > systemd has done: > > 10:freezer:/machine.slice/machine-qemu\x2dserial.scope > 9:cpuset:/machine.slice/machine-qemu\x2dserial.scope > 8:perf_event:/machine.slice/machine-qemu\x2dserial.scope > 7:hugetlb:/machine.slice/machine-qemu\x2dserial.scope > 6:blkio:/machine.slice/machine-qemu\x2dserial.scope > 5:memory:/machine.slice/machine-qemu\x2dserial.scope > 4:net_cls,net_prio:/machine.slice/machine-qemu\x2dserial.scope > 3:devices:/machine.slice/machine-qemu\x2dserial.scope > 2:cpu,cpuacct:/machine.slice/machine-qemu\x2dserial.scope > 1:name=systemd:/machine.slice/machine-qemu\x2dserial.scope > > - The QEMU driver now lets the child process exec QEMU. QEMU creates > its vCPU threads at this point. All QEMU threads (emulator, vcpu > and I/O threads) now have the cgroup placement shown above. > > - We create the emulator cgroup for the cpuset, cpu, cpuacct > controllers move all threads into this new cgroup. All threads > (emulator, vcpu and I/O threads) thus now have placement of: > > 10:freezer:/machine.slice/machine-qemu\x2dserial.scope > 9:cpuset:/machine.slice/machine-qemu\x2dserial.scope/emulator > 8:perf_event:/machine.slice/machine-qemu\x2dserial.scope > 7:hugetlb:/machine.slice/machine-qemu\x2dserial.scope > 6:blkio:/machine.slice/machine-qemu\x2dserial.scope > 5:memory:/machine.slice/machine-qemu\x2dserial.scope > 4:net_cls,net_prio:/machine.slice/machine-qemu\x2dserial.scope > 3:devices:/machine.slice/machine-qemu\x2dserial.scope > 2:cpu,cpuacct:/machine.slice/machine-qemu\x2dserial.scope/emulator > 1:name=systemd:/machine.slice/machine-qemu\x2dserial.scope > > Yes, we really did move the vcpu threads into the emulator group... > > - We now ask QEMU which are the vCPU & I/O threads. > > - Foreach CPU thread we new vCPU cgroups and move them into this > place > > 10:freezer:/machine.slice/machine-qemu\x2dserial.scope > 9:cpuset:/machine.slice/machine-qemu\x2dserial.scope/vcpuN > 8:perf_event:/machine.slice/machine-qemu\x2dserial.scope > 7:hugetlb:/machine.slice/machine-qemu\x2dserial.scope > 6:blkio:/machine.slice/machine-qemu\x2dserial.scope > 5:memory:/machine.slice/machine-qemu\x2dserial.scope > 4:net_cls,net_prio:/machine.slice/machine-qemu\x2dserial.scope > 3:devices:/machine.slice/machine-qemu\x2dserial.scope > 2:cpu,cpuacct:/machine.slice/machine-qemu\x2dserial.scope/vpuN > 1:name=systemd:/machine.slice/machine-qemu\x2dserial.scope > > - Foreach I/O thread we new vCPU cgroups and move them into this > place > > 10:freezer:/machine.slice/machine-qemu\x2dserial.scope > 9:cpuset:/machine.slice/machine-qemu\x2dserial.scope/iothreadN > 8:perf_event:/machine.slice/machine-qemu\x2dserial.scope > 7:hugetlb:/machine.slice/machine-qemu\x2dserial.scope > 6:blkio:/machine.slice/machine-qemu\x2dserial.scope > 5:memory:/machine.slice/machine-qemu\x2dserial.scope > 4:net_cls,net_prio:/machine.slice/machine-qemu\x2dserial.scope > 3:devices:/machine.slice/machine-qemu\x2dserial.scope > 2:cpu,cpuacct:/machine.slice/machine-qemu\x2dserial.scope/iothreadN > 1:name=systemd:/machine.slice/machine-qemu\x2dserial.scope > > > > Now, lets Henning's three patches > > > commit 71ce475967b3523882b0e1a0f649fdbea5abf9d5 > Author: Henning Schild <henning.schild@xxxxxxxxxxx> > Date: Wed Dec 9 17:58:10 2015 +0100 > > util: cgroups do not implicitly add task to new machine cgroup > > This alters virCgroupNewMachine so that it no longer calls > virCgroupAddTask. instead the callers are responsible for doing it. > This is a functional no-opp change, so ostensibly harmless at this > point. > > The second patch > > commit a41c00b472efaa192d2deae51ab732e65903238f > Author: Henning Schild <henning.schild@xxxxxxxxxxx> > Date: Mon Dec 14 15:48:05 2015 -0500 > > qemu: do not put a task into machine cgroup > > This stops qemuInitCgroup from calling virCgroupAddTask. Instead it > has added a call to vriCgroupAddTask in qemuSetupCgroupForEmulator(), > which only affects the cpu, cpuset & cpuacct controllers. > > This means we no longer have any guarantee about which resource > controllers the QEMU processes in general are in. All we can say is > that they are in the 'systemd' controller, and the cpu, cpuset & > cpuacct controllers. Systemd may have also placed it into *all* the > other resource controllers, or it may have placed it into none of the > them. > > As such, after this change, we are potentially not in the correct > cgroup for the memory, blkio, netcls, devices controllers. > > For added fun, the qemuSetupCgroupForEmulator() still has a call to > virCgroupMoveTask, so the addition of virCgroupAddTask to this method > was useless. > > > The final patch > > commit 90b721e43ec9232b5b218e891437bed04548e841 > Author: Henning Schild <henning.schild@xxxxxxxxxxx> > Date: Mon Dec 14 15:58:05 2015 -0500 > > qemu cgroups: move new threads to new cgroup after cpuset is set > up > > This delays the point at which we call virCgroupAddTask for the > vCPU and I/O thread cgroups, until after we have configured properties > on those cgroups. This change is fine > > > So I think we need to revert the first 2 of Hennings patches - the 2nd > one is the real serious problem, but once we revert that, the 1st > change becomes pointless and so should also be reverted. > > > Going back to the key problem Henning was trying to address.... > > The issue is that we only setup the emulator cgroup /after/ QEMU has > already started running, so there's a period of time where threads are > not confined by the cgroup affinity. > > I think we can solve that more simply by just moving the call to > qemuSetupCgroupForEmulator(), into qemuSetupCgroup(). That way we > place the emulator thread into the correct cgroup *before* we even > exec QEMU. So we never have the window where we run in the unconfined > cpuset controller. Thanks for that very thorough analysis and description, very helpful indeed! I fully agree with your conclusion and suggestion on how the code should be changed. Henning -- libvir-list mailing list libvir-list@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/libvir-list