On Thu, Jan 14, 2016 at 02:09:52PM +0100, Henning Schild wrote: > On Thu, 14 Jan 2016 12:37:18 +0000 > "Daniel P. Berrange" <berrange@xxxxxxxxxx> wrote: > > > On Thu, Jan 14, 2016 at 11:57:44AM +0000, Daniel P. Berrange wrote: > > > Since this has been puzzelling us for a while, let me recap on the > > > cgroup setup in general. > > > > > > First, I'll describe how it used to work *before* Henning's patches > > > were merged, on a systemd based host. > > > > > > - The QEMU driver forks a child process, but does *not* exec QEMU > > > yet The cgroup placement at this point is inherited from libvirtd. > > > It may look like this: > > > > > > 10:freezer:/ > > > 9:cpuset:/ > > > 8:perf_event:/ > > > 7:hugetlb:/ > > > 6:blkio:/system.slice > > > 5:memory:/system.slice > > > 4:net_cls,net_prio:/ > > > 3:devices:/system.slice/libvirtd.service > > > 2:cpu,cpuacct:/system.slice > > > 1:name=systemd:/system.slice/libvirtd.service > > > > > > - The QEMU driver calls virCgroupNewMachine() > > > > > > - We calll virSystemdCreateMachine with pidleader=$child > > > > > > - Systemd creates the initial machine scope unit under > > > the machine slice unit, for the "systemd" controller. > > > It may also add the PID to *zero* or more other > > > resource controllers. So at this point the cgroup > > > placement may look like this: > > > > > > 10:freezer:/ > > > 9:cpuset:/ > > > 8:perf_event:/ > > > 7:hugetlb:/ > > > 6:blkio:/ > > > 5:memory:/ > > > 4:net_cls,net_prio:/ > > > 3:devices:/ > > > 2:cpu,cpuacct:/ > > > 1:name=systemd:/machine.slice/machine-qemu\x2dserial.scope > > > > > > Or may look like this: > > > > > > 10:freezer:/machine.slice/machine-qemu\x2dserial.scope > > > 9:cpuset:/machine.slice/machine-qemu\x2dserial.scope > > > 8:perf_event:/machine.slice/machine-qemu\x2dserial.scope > > > 7:hugetlb:/machine.slice/machine-qemu\x2dserial.scope > > > 6:blkio:/machine.slice/machine-qemu\x2dserial.scope > > > 5:memory:/machine.slice/machine-qemu\x2dserial.scope > > > 4:net_cls,net_prio:/machine.slice/machine-qemu\x2dserial.scope > > > 3:devices:/machine.slice/machine-qemu\x2dserial.scope > > > 2:cpu,cpuacct:/machine.slice/machine-qemu\x2dserial.scope > > > 1:name=systemd:/machine.slice/machine-qemu\x2dserial.scope > > > > > > Or anywhere in between. We have *ZERO* guarantee about > > > what other resource controllers we may have been placed in by > > > systemd. There is some fairly complex logic that > > > determines this, based on what other tasks current exist in sibling > > > cgroups, and what tasks have *previously* existed in > > > the cgroups. IOW, you should consider the list of etra resource > > > controllers essentially non-deterministic > > > > > > - We call virCgroupAddTask with pid=$child > > > > > > This places the pid in any resource controllers we need, > > > which systemd has not already setup. IOW, it guarantees that we now > > > have placement that should look like this, regardless of > > > what systemd has done: > > > > > > 10:freezer:/machine.slice/machine-qemu\x2dserial.scope > > > 9:cpuset:/machine.slice/machine-qemu\x2dserial.scope > > > 8:perf_event:/machine.slice/machine-qemu\x2dserial.scope > > > 7:hugetlb:/machine.slice/machine-qemu\x2dserial.scope > > > 6:blkio:/machine.slice/machine-qemu\x2dserial.scope > > > 5:memory:/machine.slice/machine-qemu\x2dserial.scope > > > 4:net_cls,net_prio:/machine.slice/machine-qemu\x2dserial.scope > > > 3:devices:/machine.slice/machine-qemu\x2dserial.scope > > > 2:cpu,cpuacct:/machine.slice/machine-qemu\x2dserial.scope > > > 1:name=systemd:/machine.slice/machine-qemu\x2dserial.scope > > > > > > - The QEMU driver now lets the child process exec QEMU. QEMU > > > creates its vCPU threads at this point. All QEMU threads (emulator, > > > vcpu and I/O threads) now have the cgroup placement shown above. > > > > > > - We create the emulator cgroup for the cpuset, cpu, cpuacct > > > controllers move all threads into this new cgroup. All threads > > > (emulator, vcpu and I/O threads) thus now have placement of: > > > > > > 10:freezer:/machine.slice/machine-qemu\x2dserial.scope > > > 9:cpuset:/machine.slice/machine-qemu\x2dserial.scope/emulator > > > 8:perf_event:/machine.slice/machine-qemu\x2dserial.scope > > > 7:hugetlb:/machine.slice/machine-qemu\x2dserial.scope > > > 6:blkio:/machine.slice/machine-qemu\x2dserial.scope > > > 5:memory:/machine.slice/machine-qemu\x2dserial.scope > > > 4:net_cls,net_prio:/machine.slice/machine-qemu\x2dserial.scope > > > 3:devices:/machine.slice/machine-qemu\x2dserial.scope > > > 2:cpu,cpuacct:/machine.slice/machine-qemu\x2dserial.scope/emulator > > > 1:name=systemd:/machine.slice/machine-qemu\x2dserial.scope > > > > > > Yes, we really did move the vcpu threads into the emulator > > > group... > > > > > > - We now ask QEMU which are the vCPU & I/O threads. > > > > > > - Foreach CPU thread we new vCPU cgroups and move them into this > > > place > > > > > > 10:freezer:/machine.slice/machine-qemu\x2dserial.scope > > > 9:cpuset:/machine.slice/machine-qemu\x2dserial.scope/vcpuN > > > 8:perf_event:/machine.slice/machine-qemu\x2dserial.scope > > > 7:hugetlb:/machine.slice/machine-qemu\x2dserial.scope > > > 6:blkio:/machine.slice/machine-qemu\x2dserial.scope > > > 5:memory:/machine.slice/machine-qemu\x2dserial.scope > > > 4:net_cls,net_prio:/machine.slice/machine-qemu\x2dserial.scope > > > 3:devices:/machine.slice/machine-qemu\x2dserial.scope > > > 2:cpu,cpuacct:/machine.slice/machine-qemu\x2dserial.scope/vpuN > > > 1:name=systemd:/machine.slice/machine-qemu\x2dserial.scope > > > > > > - Foreach I/O thread we new vCPU cgroups and move them into this > > > place > > > > > > 10:freezer:/machine.slice/machine-qemu\x2dserial.scope > > > 9:cpuset:/machine.slice/machine-qemu\x2dserial.scope/iothreadN > > > 8:perf_event:/machine.slice/machine-qemu\x2dserial.scope > > > 7:hugetlb:/machine.slice/machine-qemu\x2dserial.scope > > > 6:blkio:/machine.slice/machine-qemu\x2dserial.scope > > > 5:memory:/machine.slice/machine-qemu\x2dserial.scope > > > 4:net_cls,net_prio:/machine.slice/machine-qemu\x2dserial.scope > > > 3:devices:/machine.slice/machine-qemu\x2dserial.scope > > > 2:cpu,cpuacct:/machine.slice/machine-qemu\x2dserial.scope/iothreadN > > > 1:name=systemd:/machine.slice/machine-qemu\x2dserial.scope > > > > BTW, on a slight tangent, the kernel is throwing a spanner in the > > works in the near future. They have just accepted cgroupv2 into > > mainline. Broadly speaking this is very nice because they got rid > > of the idea of separate mount point for each controller, and instead > > have a single filesystem tree. The problem is that they decided the > > granularity of placement is at a *process* level, not a *thread* > > level. So it will no longer be possible for us to have the cgroups > > for emulator, vcpus & i/o threads. Everything will have to live in > > the same cgroup :-( For cpu accounting and cpu affinity I think we > > can still achieve what we need by using a combination of cgroups > > and sched_setaffinity and /proc. I'm not sure what we'll do about > > per-thread schedular policies for period + quota though - not sure > > if there's an API for setting those or not ?!?! > > > > https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/Documentation/cgroup-v2.txt > > Good to know. Do you you have that on the agenda for libvirt? I guess > eventually v1 will get deprecated... We'll have no choice but to use cgroupv2 as soon as systemd starts using it.... Regards, Daniel -- |: http://berrange.com -o- http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :| -- libvir-list mailing list libvir-list@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/libvir-list