On Tue, Jul 23, 2019 at 01:13:09PM -0400, Phil Auld wrote: > Hi Dave, > > On Tue, Jul 23, 2019 at 11:44:26AM -0500 Dave Chiluk wrote: > > It has been observed, that highly-threaded, non-cpu-bound applications > > running under cpu.cfs_quota_us constraints can hit a high percentage of > > periods throttled while simultaneously not consuming the allocated > > amount of quota. This use case is typical of user-interactive non-cpu > > bound applications, such as those running in kubernetes or mesos when > > run on multiple cpu cores. > > > > This has been root caused to cpu-local run queue being allocated per cpu > > bandwidth slices, and then not fully using that slice within the period. > > At which point the slice and quota expires. This expiration of unused > > slice results in applications not being able to utilize the quota for > > which they are allocated. > > > > The non-expiration of per-cpu slices was recently fixed by > > 'commit 512ac999d275 ("sched/fair: Fix bandwidth timer clock drift > > condition")'. Prior to that it appears that this had been broken since > > at least 'commit 51f2176d74ac ("sched/fair: Fix unlocked reads of some > > cfs_b->quota/period")' which was introduced in v3.16-rc1 in 2014. That > > added the following conditional which resulted in slices never being > > expired. > > > > if (cfs_rq->runtime_expires != cfs_b->runtime_expires) { > > /* extend local deadline, drift is bounded above by 2 ticks */ > > cfs_rq->runtime_expires += TICK_NSEC; > > > > Because this was broken for nearly 5 years, and has recently been fixed > > and is now being noticed by many users running kubernetes > > (https://github.com/kubernetes/kubernetes/issues/67577) it is my opinion > > that the mechanisms around expiring runtime should be removed > > altogether. > > > > This allows quota already allocated to per-cpu run-queues to live longer > > than the period boundary. This allows threads on runqueues that do not > > use much CPU to continue to use their remaining slice over a longer > > period of time than cpu.cfs_period_us. However, this helps prevent the > > above condition of hitting throttling while also not fully utilizing > > your cpu quota. > > > > This theoretically allows a machine to use slightly more than its > > allotted quota in some periods. This overflow would be bounded by the > > remaining quota left on each per-cpu runqueueu. This is typically no > > more than min_cfs_rq_runtime=1ms per cpu. For CPU bound tasks this will > > change nothing, as they should theoretically fully utilize all of their > > quota in each period. For user-interactive tasks as described above this > > provides a much better user/application experience as their cpu > > utilization will more closely match the amount they requested when they > > hit throttling. This means that cpu limits no longer strictly apply per > > period for non-cpu bound applications, but that they are still accurate > > over longer timeframes. > > > > This greatly improves performance of high-thread-count, non-cpu bound > > applications with low cfs_quota_us allocation on high-core-count > > machines. In the case of an artificial testcase (10ms/100ms of quota on > > 80 CPU machine), this commit resulted in almost 30x performance > > improvement, while still maintaining correct cpu quota restrictions. > > That testcase is available at https://github.com/indeedeng/fibtest. > > > > Fixes: 512ac999d275 ("sched/fair: Fix bandwidth timer clock drift condition") > > Signed-off-by: Dave Chiluk <chiluk+linux@xxxxxxxxxx> > > Reviewed-by: Ben Segall <bsegall@xxxxxxxxxx> > > This still works for me. The documentation reads pretty well, too. Good job. > > Feel free to add my Acked-by: or Reviewed-by: Phil Auld <pauld@xxxxxxxxxx>. Thanks guys!