On Wed, Sep 05, 2012 at 01:32:12PM +0800, Tang Chen wrote: > Hi Srivatsa, Daniel, > > Thank you very much for all the comments. :) > > On 09/05/2012 04:57 AM, Srivatsa S. Bhat wrote: > >I had posted a Linux kernel patchset[1] some time ago to expose another > >file so that we can distinguish between the user specified settings vs the > >actual scenario underneath. But the conclusion in the ensuing discussion > >was that the existing kernel behaviour is good as is, and trying to "fix" > >it would break kernel semantics. (However, note that the suspend/resume > >case has been fixed in the kernel by commit d35be8bab). > > > >[1]. http://thread.gmane.org/gmane.linux.documentation/4805 > > > > The reason why I made this patch set is that if libvirt doesn't > recover the cpuset.cpus, all the domains with vcpus pinned to > a *re-pluged* cpu in xml will fail to start. Which means all these > domain will be unusable, or we have to modify the configuration. > > If the cpu is really removed, it is normal for a domain fails to start. > We can simply print an error message. > But if the cpu is added again, and it is active and usable, the domain > should be able to start normally. (am I right here ?) > This is the key problem I want to solve. > > So first, I improved the netlink related code in libvirt, and now > libvirt can be notified when cpu hotplug event occurred. Your patch appears to work in some limited scenarios, but more generally it will fail to work, and resulted in undesirable behaviour. Consider for example, if libvirtd is configured thus: cd /sys/fs/cgroup/cpuset mkdir demo cd demo echo 2-3 > cpuset.cpus echo 0 > cpuset.mems echo $$ > tasks /usr/sbin/libvirtd ie, libvirtd is now running on cpus 2-3, in group 'demo'. VMs will be created in /sys/fs/cgroup/cpuset/demo/libvirt/qemu/$VMNAME Your patch attempts to set the cpuset.cpus on 'libvirt/qemu/$VMNAME' but ignores the fact that there could be many higher directories (eg demo here) that need setting. libvirtd, however, should not be responsible for / allowed to change settings in parent cgroups from where it was started. ie in this example, libvirtd should *not* touch the 'demo' cgroup. So consider systemd starting tasks, giving them custom cgroups. Now systemd also has to listen for netlink events and reset the cpuset masks. Things are even worse if the admin has temporarily offlined all the cpus that are associated with the current cpuset. When this happens the kernel throws libvirtd and all its VMs out of their current cgroups and dumps them up in a parent cgroup (potentially even the root group). This is really awful. > I read the emails posted above. In summary, you discussed about the > following problems: > > 1) Make cgroup be able to distinguish actual configuration and user's. > - ( Srivatsa's idea: mask = (actual config) & (user config) ) > Seems that it is hard to be applied for some cgroup design reasons. > > 2) Kill all the tasks on the cpu when hot-unplug it. > - I don't think this is a good idea. And, this won't solve the > problem. > For example, a task binded on cpu 3. Suppose cpu 3 is unpluged, > * if the task is killed, it's just too rude, and users > running important tasks will suffer. > * if the task is migrated to other cpus, what if cpu 3 is active > again ? Are we going to see the added cpu 3 is not the original > cpu 3 ? > Whatever, the domain will still fail to start. IMHO, execution of those tasks should simply be paused (same way that the 'freezer' cgroup pauses tasks). The admin can then either move the tasks to an alternate cgroup, or change the cpuset mask to allow them to continue running. The kernel's current behaviour of pushing all tasks up into a parent cgroup is just crazy - it is just throwing away the users requested cpu mask forever :-( > 3) Make cpu hot unplug fail when there are tasks on it. > - This may be unacceptable for hotplug users. And this won't solve > the problem either. > If the domain is not running when the hot unplug happens, the hot > unplug will succeed. And when we start the domain, it will fail > anyway, right ? > > 4) Make libvirt not use cpuset cgroup. > - For now, seems impossable. > sched_setaffinity() behaves properly, which assumes the repluged > cpu is the same one unpluged before. (am I right ?) > But with cgroup's control, we cannot resolve this problem using > sched_setaffinity(). > > > If I want to solve the start failure problem, what should I do ? I maintain the problems we see with cpuset controller cannot be reasonably solved by libvirtd, or userspace in general. The kernel behaviour is just flawed. If the kernel won't fix it, then we should recommend people not to use the cpuset cgroup at all, and just rely on our sched_setaffinity support instead. Daniel -- |: http://berrange.com -o- http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :| -- libvir-list mailing list libvir-list@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/libvir-list