On Thu, Jun 11, 2015 at 2:33 PM, Daniel P. Berrange <berrange@xxxxxxxxxx> wrote: > On Thu, Jun 11, 2015 at 02:16:50PM +0300, Andrey Korolyov wrote: >> On Thu, Jun 11, 2015 at 2:09 PM, Daniel P. Berrange <berrange@xxxxxxxxxx> wrote: >> > On Thu, Jun 11, 2015 at 01:50:24PM +0300, Andrey Korolyov wrote: >> >> Hi Daniel, >> >> >> >> would it possible to adopt an optional tunable for a virCgroup >> >> mechanism which targets to a disablement of a nested (per-thread) >> >> cgroup creation? Those are bringing visible overhead for many-threaded >> >> guest workloads, almost 5% in non-congested host CPU state, primarily >> >> because the host scheduler should make a much more decisions with >> >> those cgroups than without them. We also experienced a lot of host >> >> lockups with currently exploited cgroup placement and disabled nested >> >> behavior a couple of years ago. Though the current patch is simply >> >> carves out the mentioned behavior, leaving only top-level per-machine >> >> cgroups, it can serve for an upstream after some adaptation, that`s >> >> why I`m asking about a chance of its acceptance. This message is a >> >> kind of 'request of a feature', it either can be accepted/dropped from >> >> our side or someone may give a hand and redo it from scratch. The >> >> detailed benchmarks are related to a host 3.10.y, if anyone is >> >> interested in the numbers for latest stable, I can update those. >> > >> > When you say nested cgroup creation, as you referring to the modern >> > libvirt hierarchy, or the legacy hierarchy - as described here: >> > >> > http://libvirt.org/cgroups.html >> > >> > The current libvirt setup used for a year or so now is much shallower >> > than previously, to the extent that we'd consider performance problems >> > with it to be the job of the kernel to fix. >> >> Thanks, I`m referring to a 'new nested' hiearchy for an overhead >> mentioned above. The host crashes I mentioned happened with old >> hierarchy back ago, forgot to mention this. Despite the flattening of >> the topo for the current scheme it should be possible to disable fine >> group creation for the VM threads for some users who don`t need >> per-vcpu cpu pinning/accounting (though overhead caused by a placement >> for cpu cgroup, not by accounting/pinning ones, I`m assuming equal >> distribution with such disablement for all nested-aware cgroup types), >> that`s the point for now. > > Ok, so the per-vCPU cgroups are used for a couple of things > > - Setting scheduler tunables - period/quota/shares/etc > - Setting CPU pinning > - Setting NUMA memory pinning > > In addition to the per-VCPU cgroup, we have one cgroup fr each > I/O thread, and also one more for general QEMU emulator threads. > > In the case of CPU pinning we already have automatic fallback to > sched_setaffinity if the CPUSET controller isn't available. > > We could in theory start off without the per-vCPU/emulator/I/O > cgroups and only create them as & when the feature is actually > used. The concern I would have though is that changing the cgroups > layout on the fly may cause unexpected sideeffects in behaviour of > the VM. More critically, there would be alot of places in the code > where we would need to deal with this which could hurt maintainability. > > How confident are you that the performance problems you see are inherant > to the actual use of the cgroups, and not instead as a result of some > particular bad choice of default parameters we might have left in the > cgroups ? In general I'd have a desire to try to work to eliminate the > perf impact before we consider the complexity of disabling this feature > > Regards, > Daniel Hm, what are you proposing to begin with in a testing terms? By my understanding the excessive cgroup usage along with small scheduler quanta *will* lead to some overhead anyway. Let`s look at the numbers which I would bring tomorrow, the mentioned five percents was catched on a guest 'perf numa xxx' for a different kind of mappings and host behavior (post-3.8): memory automigration on/off, kind of 'numa passthrough', like grouping vcpu threads according to the host and emulated guest NUMA topologies, totally scattered and unpinned threads within a single and within a multiple NUMA nodes. As the result for 3.10.y, there was a five-percent difference between best-performing case with thread-level cpu cgroups and a 'totally scattered' case on a simple mid-range two-headed node. If you think that the choice of an emulated workload is wrong, please let me know, I was afraid that the non-synthetic workload in the guest may suffer from a range of a side factors and therefore chose perf for this task. -- libvir-list mailing list libvir-list@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/libvir-list