Re: [PATCH v2 1/1] sched/fair: Fix low cpu usage with high throttling by removing expiration of cpu-local slices

Peter Oskolkov <posk@xxxxxxx> · Fri, 24 May 2019 09:28:30 -0700

On Fri, May 24, 2019 at 8:15 AM Dave Chiluk <chiluk+linux@xxxxxxxxxx> wrote:
>
> On Fri, May 24, 2019 at 9:32 AM Phil Auld <pauld@xxxxxxxxxx> wrote:
> > On Thu, May 23, 2019 at 02:01:58PM -0700 Peter Oskolkov wrote:
>
> > > If the machine runs at/close to capacity, won't the overallocation
> > > of the quota to bursty tasks necessarily negatively impact every other
> > > task? Should the "unused" quota be available only on idle CPUs?
> > > (Or maybe this is the behavior achieved here, and only the comment and
> > > the commit message should be fixed...)
> > >
> >
> > It's bounded by the amount left unused from the previous period. So
> > theoretically a process could use almost twice its quota. But then it
> > would have nothing left over in the next period. To repeat it would have
> > to not use any that next period. Over a longer number of periods it's the
> > same amount of CPU usage.
> >
> > I think that is more fair than throttling a process that has never used
> > its full quota.
> >
> > And it removes complexity.
> >
> > Cheers,
> > Phil
>
> Actually it's not even that bad.  The overallocation of quota to a
> bursty task in a period is limited to at most one slice per cpu, and
> that slice must not have been used in the previous periods.  The slice
> size is set with /proc/sys/kernel/sched_cfs_bandwidth_slice_us and
> defaults to 5ms.  If a bursty task goes from underutilizing quota to
> using it's entire quota, it will not be able to burst in the
> subsequent periods.  Therefore in an absolute worst case contrived
> scenario, a bursty task can add at most 5ms to the latency of other
> threads on the same CPU.  I think this worst case 5ms tradeoff is
> entirely worth it.
>
> This does mean that a theoretically a poorly written massively
> threaded application on an 80 core box, that spreads itself onto 80
> cpu run queues, can overutilize it's quota in a period by at most 5ms
> * 80 CPUs in a sincle period (slice * number of runqueues the
> application is running on).  But that means that each of those threads
>  would have had to not be use their quota in a previous period, and it
> also means that the application would have to be carefully written to
> exacerbate this behavior.
>
> Additionally if cpu bound threads underutilize a slice of their quota
> in a period due to the cfs choosing a bursty task to run, they should
> theoretically be able to make it up in the following periods when the
> bursty task is unable to "burst".

OK, so it is indeed possible that CPU bound threads will underutilize a slice
of their quota in a period as a result of this patch. This should probably
be clearly stated in the code comments and in the commit message.

In addition, I believe that although many workloads will indeed be
indifferent to getting their fair share "later", some latency-sensitive
workloads will definitely be negatively affected by this temporary
CPU quota stealing by bursty antagonists. So there should probably be
a way to limit this behavior; for example, by making it tunable
per cgroup.

>
> Please be careful here quota and slice are being treated differently.
> Quota does not roll-over between periods, only slices of quota that
> has already been allocated to per cpu run queues. If you allocate
> 100ms of quota per period to an application, but it only spreads onto
> 3 cpu run queues that means it can in the worst case use 3 x slice
> size = 15ms in periods following underutilization.
>
> So why does this matter.  Well applications that use thread pools
> *(*cough* java *cough*) with lots of tiny little worker threads, tend
> to spread themselves out onto a lot of run queues.  These worker
> threads grab quota slices in order to run, then rarely use all of
> their slice (1 or 2ms out of the 5ms).  This results in those worker
> threads starving the main application of quota, and then expiring the
> remainder of that quota slice on the per-cpu.  Going back to my
> earlier 100ms quota / 80 cpu example.  That means only
> 100ms/cfs_bandwidth_slice_us(5ms) = 20 slices are available in a
> period.  So only 20 out of these 80 cpus ever get a slice allocated to
> them.  By allowing these per-cpu run queues to use their remaining
> slice in following periods these worker threads do not need to be
> allocated additional slice, and thereby the main threads are actually
> able to use the allocated cpu quota.
>
> This can be experienced by running fibtest available at
> https://github.com/indeedeng/fibtest/.
> $ runfibtest 1
> runs a single fast thread taskset to cpu 0
> $ runfibtest 8
> Runs a single fast thread taskset to cpu 0, and 7 slow threads taskset
> to cpus 1-7.  This run is expected to show less iterations, but the
> worse problem is that the cpu usage is far less than the 500ms that it
> should have received.
>
> Thanks for the engagement on this,
> Dave Chiluk