On Fri, 26 Feb 2016 13:00:04 +0000 "Daniel P. Berrange" <berrange@xxxxxxxxxx> wrote: > On Fri, Feb 26, 2016 at 01:43:05PM +0100, Henning Schild wrote: > > On Fri, 26 Feb 2016 12:21:02 +0000 > > "Daniel P. Berrange" <berrange@xxxxxxxxxx> wrote: > > > > > On Fri, Feb 26, 2016 at 01:16:15PM +0100, Henning Schild wrote: > > > > On Fri, 26 Feb 2016 11:13:13 +0000 > > > > "Daniel P. Berrange" <berrange@xxxxxxxxxx> wrote: > > > > > > > > > On Tue, Feb 23, 2016 at 04:58:39PM +0100, Henning Schild > > > > > wrote: > > > > > > virCgroupNewMachine used to add the pidleader to the newly > > > > > > created machine cgroup. Do not do this implicit anymore. > > > > > > > > > > > > Signed-off-by: Henning Schild <henning.schild@xxxxxxxxxxx> > > > > > > --- > > > > > > src/lxc/lxc_cgroup.c | 11 +++++++++++ > > > > > > src/qemu/qemu_cgroup.c | 11 +++++++++++ > > > > > > src/util/vircgroup.c | 22 ---------------------- > > > > > > 3 files changed, 22 insertions(+), 22 deletions(-) > > > > > > > > > > NACK to this patch once again. > > > > > > > > > > This does not actually work as you think it does. > > > > > > > > > > > diff --git a/src/util/vircgroup.c b/src/util/vircgroup.c > > > > > > index 11f33ab..aef8e8c 100644 > > > > > > --- a/src/util/vircgroup.c > > > > > > +++ b/src/util/vircgroup.c > > > > > > @@ -1682,16 +1682,6 @@ virCgroupNewMachineSystemd(const char > > > > > > *name, } > > > > > > } > > > > > > > > > > > > - if (virCgroupAddTask(*group, pidleader) < 0) { > > > > > > - virErrorPtr saved = virSaveLastError(); > > > > > > - virCgroupRemove(*group); > > > > > > - virCgroupFree(group); > > > > > > - if (saved) { > > > > > > - virSetError(saved); > > > > > > - virFreeError(saved); > > > > > > - } > > > > > > - } > > > > > > > > > > Just above this we called virSystemdCreateMachine. Systemd > > > > > will create the cgroup and add the pidleader to those > > > > > cgroups. Systemd may add the pidleader to just the 'systemd' > > > > > controller, or it may add the pidleader to *ALL* controllers. > > > > > We have no way of knowing. > > > > > > > > > > This virCgroupAddTask call deals with whatever systemd chose > > > > > not todo, so we can guarantee consistent behaviour with the > > > > > pidleader in all cgroups. > > > > > > > > > > By removing this you make this method non-deterministic - the > > > > > pid may or may not be in the cpu controller now. THis is bad > > > > > because it can lead to QEMU/LXC driver code working in some > > > > > cases but failing in other cases. > > > > > > > > > > Furthermore, this existing does not cause any problems for the > > > > > scenario you care about. THis cgroup placement is being set > > > > > in between the time libvirtd calls fork() and exec(). With > > > > > your later patch 5, we ensure that the PID is moved across > > > > > into the emulator cgroup, before we call exec(). When we call > > > > > exec all memory mappings will be replaced, so QEMU will stil > > > > > start with the correct vCPU placement and memory allocation > > > > > placement. > > > > > > > > I agree having the task in the wrong cgroup before the exec() > > > > seems harmless. But i am not sure all the fiddling with cgroups > > > > is indeed harmless and does not cause i.e. kernel work on cores > > > > that should be left alone. I have the feeling allowing the task > > > > in the parent cgroup is a bad idea, no matter how short the > > > > window seems to be. > > > > > > > > Right now the parent cgroup contains all cpus found in > > > > machine.slice, which for pinned VMs is too much. How about we > > > > calculate the size of the child cgroups before and make the > > > > parent the union of them. Or give the parent the emulator > > > > pinning and extend it for the vcpus later. But that might turn > > > > out pretty complicated as well, getting the order right with > > > > the mix of cpusets and sched_setaffinity(). > > > > > Just just drop this patch please. > > > > > > The point is though that we have *no* choice. Systemd can put the > > > task in the cpu controller and we've no way to prevent that. So > > > the code *has* to be able to cope with that happening. Therefore > > > this patch is wrong it just makes behaviour non-deterministic > > > increasing the chances that we don't correctly handle the case > > > where systemd adds the task to the cpu controllers > > > > Understood! I was suggesting a "growing on demand" policy instead of > > "shrinking after inheriting all". > > > > If we can not control what systemd does we have to give it harmless > > cpus to mess around with. That assumes we can control the size of > > the cpuset before systemd puts anything in. Once back in control we > > grow the parent group before deriving more child groups. > > > > Would that be possible? > > > > I have no objections to keep using the shrinking approach. > > Especially since the controlled growing is harder to implement in > > the given codebase. It just feels like it should be the other way > > around. > > I don't see what actual problem you are trying to solve here. > > > IIUC, the original problem you wanted to address was that vCPU pids > could run the wrong CPU for a short time. ie The original code did > > 1. Libvirtd forks child pid > ... this pid runs on whatever pCPUs that libvirt is permitted to > use... 2. Libvirtd creates root cgroup for VM > ... this pid runs on whatever pCPUs the root cgroup inherited... > 3. Child pid execs QEMU > ... QEMU pid runs on whatever pCPUs the root cgroup inherited... > 4. QEMU creates vCPU pids > ... vCPU pids run on whatever pCPUs the root cgroup inherited... > 4. Libvirtd moves emulator PIDs and vCPU PIDs > ... emulator PIDs are running on assigned pCPUs for emulator... > ... vCPU PIDs are running on assigned pCPUs for vCPUs.... > > With the important change in patch 5 this now looks like > > 1. Libvirtd forks child pid > ... this pid runs on whatever pCPUs that libvirt is permitted to > use... 2. Libvirtd creates root cgroup for VM > ... this pid runs on whatever pCPUs the root cgroup inherited... I am trying to come up with a solution that eliminates the above line from the whole bringup. I.e never allow a pid belonging to the VM outside the pinnings of libvirtd and the VM configuration. But until step 4 it should probably be "... this pid *just sits* on whatever pCPUs the root cgroup inherited..." If we are sure that it does not "run" before 4. patch 5 does the trick already > 3. Libvirtd moves pid into emulator group > ... this pid runs on assigned pCPUs for emulator... > 4. Child pid execs QEMU > ... QEMU pid runs on assigned pCPUs for emulator... > 5. QEMU creates vCPU pids > ... vCPU pids are running on assigned pCPUS for emulator... > 6. Libvirtd moves vCPU PIDs > ... emulator PIDs are running on assigned pCPUs for emulator... > ... vCPU PIDs are running on assigned pCPUs for vCPUs.... > > Which is good, because vCPU pids don't ever run on un-restricted > pCPUs. > > > Your patch 4 here is attempting to change step 2 only so that it > looks like > > > 1. Libvirtd forks child pid > ... this pid runs on whatever pCPUs that libvirt is permitted to > use... 2. Libvirtd creates root cgroup for VM > ... this pid runs on whatever pCPUs that libvirt is permitted to > use... or depending on what controller system added > ... this pid runs on whatever pCPUs the root cgroup inherited... > 3. Libvirtd adds pid into emulator group > ... this pid runs on assigned pCPUs for emulator... > 4. Child pid execs QEMU > ... QEMU pid runs on assigned pCPUs for emulator... > 5. QEMU creates vCPU pids > ... vCPU pids are running on assigned pCPUS for emulator... > 6. Libvirtd moves vCPU PIDs > ... emulator PIDs are running on assigned pCPUs for emulator... > ... vCPU PIDs are running on assigned pCPUs for vCPUs.... > > At the time we exec QEMU in step 4 the situation is exactly the same > as before. The vCPU pids are still created in the right place > straight away. > > So this patch 4 doesn't achieve anything useful > > Regards, > Daniel -- libvir-list mailing list libvir-list@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/libvir-list