On Thu, 14 Jan 2016 12:37:18 +0000 "Daniel P. Berrange" <berrange@xxxxxxxxxx> wrote: > On Thu, Jan 14, 2016 at 11:57:44AM +0000, Daniel P. Berrange wrote: > > Since this has been puzzelling us for a while, let me recap on the > > cgroup setup in general. > > > > First, I'll describe how it used to work *before* Henning's patches > > were merged, on a systemd based host. > > > > - The QEMU driver forks a child process, but does *not* exec QEMU > > yet The cgroup placement at this point is inherited from libvirtd. > > It may look like this: > > > > 10:freezer:/ > > 9:cpuset:/ > > 8:perf_event:/ > > 7:hugetlb:/ > > 6:blkio:/system.slice > > 5:memory:/system.slice > > 4:net_cls,net_prio:/ > > 3:devices:/system.slice/libvirtd.service > > 2:cpu,cpuacct:/system.slice > > 1:name=systemd:/system.slice/libvirtd.service > > > > - The QEMU driver calls virCgroupNewMachine() > > > > - We calll virSystemdCreateMachine with pidleader=$child > > > > - Systemd creates the initial machine scope unit under > > the machine slice unit, for the "systemd" controller. > > It may also add the PID to *zero* or more other > > resource controllers. So at this point the cgroup > > placement may look like this: > > > > 10:freezer:/ > > 9:cpuset:/ > > 8:perf_event:/ > > 7:hugetlb:/ > > 6:blkio:/ > > 5:memory:/ > > 4:net_cls,net_prio:/ > > 3:devices:/ > > 2:cpu,cpuacct:/ > > 1:name=systemd:/machine.slice/machine-qemu\x2dserial.scope > > > > Or may look like this: > > > > 10:freezer:/machine.slice/machine-qemu\x2dserial.scope > > 9:cpuset:/machine.slice/machine-qemu\x2dserial.scope > > 8:perf_event:/machine.slice/machine-qemu\x2dserial.scope > > 7:hugetlb:/machine.slice/machine-qemu\x2dserial.scope > > 6:blkio:/machine.slice/machine-qemu\x2dserial.scope > > 5:memory:/machine.slice/machine-qemu\x2dserial.scope > > 4:net_cls,net_prio:/machine.slice/machine-qemu\x2dserial.scope > > 3:devices:/machine.slice/machine-qemu\x2dserial.scope > > 2:cpu,cpuacct:/machine.slice/machine-qemu\x2dserial.scope > > 1:name=systemd:/machine.slice/machine-qemu\x2dserial.scope > > > > Or anywhere in between. We have *ZERO* guarantee about > > what other resource controllers we may have been placed in by > > systemd. There is some fairly complex logic that > > determines this, based on what other tasks current exist in sibling > > cgroups, and what tasks have *previously* existed in > > the cgroups. IOW, you should consider the list of etra resource > > controllers essentially non-deterministic > > > > - We call virCgroupAddTask with pid=$child > > > > This places the pid in any resource controllers we need, > > which systemd has not already setup. IOW, it guarantees that we now > > have placement that should look like this, regardless of > > what systemd has done: > > > > 10:freezer:/machine.slice/machine-qemu\x2dserial.scope > > 9:cpuset:/machine.slice/machine-qemu\x2dserial.scope > > 8:perf_event:/machine.slice/machine-qemu\x2dserial.scope > > 7:hugetlb:/machine.slice/machine-qemu\x2dserial.scope > > 6:blkio:/machine.slice/machine-qemu\x2dserial.scope > > 5:memory:/machine.slice/machine-qemu\x2dserial.scope > > 4:net_cls,net_prio:/machine.slice/machine-qemu\x2dserial.scope > > 3:devices:/machine.slice/machine-qemu\x2dserial.scope > > 2:cpu,cpuacct:/machine.slice/machine-qemu\x2dserial.scope > > 1:name=systemd:/machine.slice/machine-qemu\x2dserial.scope > > > > - The QEMU driver now lets the child process exec QEMU. QEMU > > creates its vCPU threads at this point. All QEMU threads (emulator, > > vcpu and I/O threads) now have the cgroup placement shown above. > > > > - We create the emulator cgroup for the cpuset, cpu, cpuacct > > controllers move all threads into this new cgroup. All threads > > (emulator, vcpu and I/O threads) thus now have placement of: > > > > 10:freezer:/machine.slice/machine-qemu\x2dserial.scope > > 9:cpuset:/machine.slice/machine-qemu\x2dserial.scope/emulator > > 8:perf_event:/machine.slice/machine-qemu\x2dserial.scope > > 7:hugetlb:/machine.slice/machine-qemu\x2dserial.scope > > 6:blkio:/machine.slice/machine-qemu\x2dserial.scope > > 5:memory:/machine.slice/machine-qemu\x2dserial.scope > > 4:net_cls,net_prio:/machine.slice/machine-qemu\x2dserial.scope > > 3:devices:/machine.slice/machine-qemu\x2dserial.scope > > 2:cpu,cpuacct:/machine.slice/machine-qemu\x2dserial.scope/emulator > > 1:name=systemd:/machine.slice/machine-qemu\x2dserial.scope > > > > Yes, we really did move the vcpu threads into the emulator > > group... > > > > - We now ask QEMU which are the vCPU & I/O threads. > > > > - Foreach CPU thread we new vCPU cgroups and move them into this > > place > > > > 10:freezer:/machine.slice/machine-qemu\x2dserial.scope > > 9:cpuset:/machine.slice/machine-qemu\x2dserial.scope/vcpuN > > 8:perf_event:/machine.slice/machine-qemu\x2dserial.scope > > 7:hugetlb:/machine.slice/machine-qemu\x2dserial.scope > > 6:blkio:/machine.slice/machine-qemu\x2dserial.scope > > 5:memory:/machine.slice/machine-qemu\x2dserial.scope > > 4:net_cls,net_prio:/machine.slice/machine-qemu\x2dserial.scope > > 3:devices:/machine.slice/machine-qemu\x2dserial.scope > > 2:cpu,cpuacct:/machine.slice/machine-qemu\x2dserial.scope/vpuN > > 1:name=systemd:/machine.slice/machine-qemu\x2dserial.scope > > > > - Foreach I/O thread we new vCPU cgroups and move them into this > > place > > > > 10:freezer:/machine.slice/machine-qemu\x2dserial.scope > > 9:cpuset:/machine.slice/machine-qemu\x2dserial.scope/iothreadN > > 8:perf_event:/machine.slice/machine-qemu\x2dserial.scope > > 7:hugetlb:/machine.slice/machine-qemu\x2dserial.scope > > 6:blkio:/machine.slice/machine-qemu\x2dserial.scope > > 5:memory:/machine.slice/machine-qemu\x2dserial.scope > > 4:net_cls,net_prio:/machine.slice/machine-qemu\x2dserial.scope > > 3:devices:/machine.slice/machine-qemu\x2dserial.scope > > 2:cpu,cpuacct:/machine.slice/machine-qemu\x2dserial.scope/iothreadN > > 1:name=systemd:/machine.slice/machine-qemu\x2dserial.scope > > BTW, on a slight tangent, the kernel is throwing a spanner in the > works in the near future. They have just accepted cgroupv2 into > mainline. Broadly speaking this is very nice because they got rid > of the idea of separate mount point for each controller, and instead > have a single filesystem tree. The problem is that they decided the > granularity of placement is at a *process* level, not a *thread* > level. So it will no longer be possible for us to have the cgroups > for emulator, vcpus & i/o threads. Everything will have to live in > the same cgroup :-( For cpu accounting and cpu affinity I think we > can still achieve what we need by using a combination of cgroups > and sched_setaffinity and /proc. I'm not sure what we'll do about > per-thread schedular policies for period + quota though - not sure > if there's an API for setting those or not ?!?! > > https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/Documentation/cgroup-v2.txt Good to know. Do you you have that on the agenda for libvirt? I guess eventually v1 will get deprecated... > Regards, > Daniel -- libvir-list mailing list libvir-list@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/libvir-list