Re: Overhead for a default cpu cg placement scheme

Andrey Korolyov <andrey@xxxxxxx> · Thu, 11 Jun 2015 16:06:59 +0300

On Thu, Jun 11, 2015 at 2:33 PM, Daniel P. Berrange <berrange@xxxxxxxxxx> wrote:
> On Thu, Jun 11, 2015 at 02:16:50PM +0300, Andrey Korolyov wrote:
>> On Thu, Jun 11, 2015 at 2:09 PM, Daniel P. Berrange <berrange@xxxxxxxxxx> wrote:
>> > On Thu, Jun 11, 2015 at 01:50:24PM +0300, Andrey Korolyov wrote:
>> >> Hi Daniel,
>> >>
>> >> would it possible to adopt an optional tunable for a virCgroup
>> >> mechanism which targets to a disablement of a nested (per-thread)
>> >> cgroup creation? Those are bringing visible overhead for many-threaded
>> >> guest workloads, almost 5% in non-congested host CPU state, primarily
>> >> because the host scheduler should make a much more decisions with
>> >> those cgroups than without them. We also experienced a lot of host
>> >> lockups with currently exploited cgroup placement and disabled nested
>> >> behavior a couple of years ago. Though the current patch is simply
>> >> carves out the mentioned behavior, leaving only top-level per-machine
>> >> cgroups, it can serve for an upstream after some adaptation, that`s
>> >> why I`m asking about a chance of its acceptance. This message is a
>> >> kind of 'request of a feature', it either can be accepted/dropped from
>> >> our side or someone may give a hand and redo it from scratch. The
>> >> detailed benchmarks are related to a host 3.10.y, if anyone is
>> >> interested in the numbers for latest stable, I can update those.
>> >
>> > When you say nested cgroup creation, as you referring to the modern
>> > libvirt hierarchy, or the legacy hierarchy - as described here:
>> >
>> >   http://libvirt.org/cgroups.html
>> >
>> > The current libvirt setup used for a year or so now is much shallower
>> > than previously, to the extent that we'd consider performance problems
>> > with it to be the job of the kernel to fix.
>>
>> Thanks, I`m referring to a 'new nested' hiearchy for an overhead
>> mentioned above. The host crashes I mentioned happened with old
>> hierarchy back ago, forgot to mention this. Despite the flattening of
>> the topo for the current scheme it should be possible to disable fine
>> group creation for the VM threads for some users who don`t need
>> per-vcpu cpu pinning/accounting (though overhead caused by a placement
>> for cpu cgroup, not by accounting/pinning ones, I`m assuming equal
>> distribution with such disablement for all nested-aware cgroup types),
>> that`s the point for now.
>
> Ok, so the per-vCPU cgroups are used for a couple of things
>
>  - Setting scheduler tunables - period/quota/shares/etc
>  - Setting CPU pinning
>  - Setting NUMA memory pinning
>
> In addition to the per-VCPU cgroup, we have one cgroup fr each
> I/O thread, and also one more for general QEMU emulator threads.
>
> In the case of CPU pinning we already have automatic fallback to
> sched_setaffinity if the CPUSET controller isn't available.
>
> We could in theory start off without the per-vCPU/emulator/I/O
> cgroups and only create them as & when the feature is actually
> used. The concern I would have though is that changing the cgroups
> layout on the fly may cause unexpected sideeffects in behaviour of
> the VM. More critically, there would be alot of places in the code
> where we would need to deal with this which could hurt maintainability.
>
> How confident are you that the performance problems you see are inherant
> to the actual use of the cgroups, and not instead as a result of some
> particular bad choice of default parameters we might have left in the
> cgroups ?  In general I'd have a desire to try to work to eliminate the
> perf impact before we consider the complexity of disabling this feature
>
> Regards,
> Daniel

Hm, what are you proposing to begin with in a testing terms? By my
understanding the excessive cgroup usage along with small scheduler
quanta *will* lead to some overhead anyway. Let`s look at the numbers
which I would bring tomorrow, the mentioned five percents was catched
on a guest 'perf numa xxx' for a different kind of mappings and host
behavior (post-3.8): memory automigration on/off, kind of 'numa
passthrough', like grouping vcpu threads according to the host and
emulated guest NUMA topologies, totally scattered and unpinned threads
within a single and within a multiple NUMA nodes. As the result for
3.10.y, there was a five-percent difference between best-performing
case with thread-level cpu cgroups and a 'totally scattered' case on a
simple mid-range two-headed node. If you think that the choice of an
emulated workload is wrong, please let me know, I was afraid that the
non-synthetic workload in the guest may suffer from a range of a side
factors and therefore chose perf for this task.

--
libvir-list mailing list
libvir-list@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/libvir-list