Re: [PATCH 0/7] introduce cpu.headroom knob to cpu controller

Vincent Guittot <vincent.guittot@xxxxxxxxxx> · Tue, 30 Apr 2019 18:20:46 +0200

Hi Song,

On Tue, 30 Apr 2019 at 08:11, Song Liu <songliubraving@xxxxxx> wrote:
>
>
>
> > On Apr 29, 2019, at 8:24 AM, Vincent Guittot <vincent.guittot@xxxxxxxxxx> wrote:
> >
> > Hi Song,
> >
> > On Sun, 28 Apr 2019 at 21:47, Song Liu <songliubraving@xxxxxx> wrote:
> >>
> >> Hi Morten and Vincent,
> >>
> >>> On Apr 22, 2019, at 6:22 PM, Song Liu <songliubraving@xxxxxx> wrote:
> >>>
> >>> Hi Vincent,
> >>>
> >>>> On Apr 17, 2019, at 5:56 AM, Vincent Guittot <vincent.guittot@xxxxxxxxxx> wrote:
> >>>>
> >>>> On Wed, 10 Apr 2019 at 21:43, Song Liu <songliubraving@xxxxxx> wrote:
> >>>>>
> >>>>> Hi Morten,
> >>>>>
> >>>>>> On Apr 10, 2019, at 4:59 AM, Morten Rasmussen <morten.rasmussen@xxxxxxx> wrote:
> >>>>>>
> >>>>
> >>>>>>
> >>>>>> The bit that isn't clear to me, is _why_ adding idle cycles helps your
> >>>>>> workload. I'm not convinced that adding headroom gives any latency
> >>>>>> improvements beyond watering down the impact of your side jobs. AFAIK,
> >>>>>
> >>>>> We think the latency improvements actually come from watering down the
> >>>>> impact of side jobs. It is not just statistically improving average
> >>>>> latency numbers, but also reduces resource contention caused by the side
> >>>>> workload. I don't know whether it is from reducing contention of ALUs,
> >>>>> memory bandwidth, CPU caches, or something else, but we saw reduced
> >>>>> latencies when headroom is used.
> >>>>>
> >>>>>> the throttling mechanism effectively removes the throttled tasks from
> >>>>>> the schedule according to a specific duty cycle. When the side job is
> >>>>>> not throttled the main workload is experiencing the same latency issues
> >>>>>> as before, but by dynamically tuning the side job throttling you can
> >>>>>> achieve a better average latency. Am I missing something?
> >>>>>>
> >>>>>> Have you looked at your distribution of main job latency and tried to
> >>>>>> compare with when throttling is active/not active?
> >>>>>
> >>>>> cfs_bandwidth adjusts allowed runtime for each task_group each period
> >>>>> (configurable, 100ms by default). cpu.headroom logic applies gentle
> >>>>> throttling, so that the side workload gets some runtime in every period.
> >>>>> Therefore, if we look at time window equal to or bigger than 100ms, we
> >>>>> don't really see "throttling active time" vs. "throttling inactive time".
> >>>>>
> >>>>>>
> >>>>>> I'm wondering if the headroom solution is really the right solution for
> >>>>>> your use-case or if what you are really after is something which is
> >>>>>> lower priority than just setting the weight to 1. Something that
> >>>>>
> >>>>> The experiments show that, cpu.weight does proper work for priority: the
> >>>>> main workload gets priority to use the CPU; while the side workload only
> >>>>> fill the idle CPU. However, this is not sufficient, as the side workload
> >>>>> creates big enough contention to impact the main workload.
> >>>>>
> >>>>>> (nearly) always gets pre-empted by your main job (SCHED_BATCH and
> >>>>>> SCHED_IDLE might not be enough). If your main job consist
> >>>>>> of lots of relatively short wake-ups things like the min_granularity
> >>>>>> could have significant latency impact.
> >>>>>
> >>>>> cpu.headroom gives benefits in addition to optimizations in pre-empt
> >>>>> side. By maintaining some idle time, fewer pre-empt actions are
> >>>>> necessary, thus the main workload will get better latency.
> >>>>
> >>>> I agree with Morten's proposal, SCHED_IDLE should help your latency
> >>>> problem because side job will be directly preempted unlike normal cfs
> >>>> task even lowest priority.
> >>>> In addition to min_granularity, sched_period also has an impact on the
> >>>> time that a task has to wait before preempting the running task. Also,
> >>>> some sched_feature like GENTLE_FAIR_SLEEPERS can also impact the
> >>>> latency of a task.
> >>>>
> >>>> It would be nice to know if the latency problem comes from contention
> >>>> on cache resources or if it's mainly because you main load waits
> >>>> before running on a CPU
> >>>>
> >>>> Regards,
> >>>> Vincent
> >>>
> >>> Thanks for these suggestions. Here are some more tests to show the impact
> >>> of scheduler knobs and cpu.headroom.
> >>>
> >>> side-load | cpu.headroom | side/cpu.weight | min_gran | cpu-idle | main/latency
> >>> --------------------------------------------------------------------------------
> >>> none    |      0       |     n/a         |    1 ms  |  45.20%  |   1.00
> >>> ffmpeg   |      0       |      1          |   10 ms  |   3.38%  |   1.46
> >>> ffmpeg   |      0       |   SCHED_IDLE    |    1 ms  |   5.69%  |   1.42
> >>> ffmpeg   |    20%       |   SCHED_IDLE    |    1 ms  |  19.00%  |   1.13
> >>> ffmpeg   |    30%       |   SCHED_IDLE    |    1 ms  |  27.60%  |   1.08
> >>>
> >>> In all these cases, the main workload is loaded with same level of
> >>> traffic (request per second). Main workload latency numbers are normalized
> >>> based on the baseline (first row).
> >>>
> >>> For the baseline, the main workload runs without any side workload, the
> >>> system has about 45.20% idle CPU.
> >>>
> >>> The next two rows compare the impact of scheduling knobs cpu.weight and
> >>> sched_min_granularity. With cpu.weight of 1 and min_granularity of 10ms,
> >>> we see a latency of 1.46; with SCHED_IDLE and min_granularity of 1ms, we
> >>> see a latency of 1.42. So SCHED_IDLE and min_granularity help protecting
> >>> the main workload. However, it is not sufficient, as the latency overhead
> >>> is high (>40%).
> >>>
> >>> The last two rows show the benefit of cpu.headroom. With 20% headroom,
> >>> the latency is 1.13; while with 30% headroom, the latency is 1.08.
> >>>
> >>> We can also see a clear correlation between latency and global idle CPU:
> >>> more idle CPU yields better lower latency.
> >>>
> >>> Over all, these results show that cpu.headroom provides effective
> >>> mechanism to control the latency impact of side workloads. Other knobs
> >>> could also help the latency, but they are not as effective and flexible
> >>> as cpu.headroom.
> >>>
> >>> Does this analysis address your concern?
> >
> > So, you results show that sched_idle class doesn't provide the
> > intended behavior because it still delay the scheduling of sched_other
> > tasks. In fact, the wakeup path of the scheduler doesn't make any
> > difference between a cpu running a sched_other and a cpu running a
> > sched_idle when looking for the idlest cpu and it can create some
> > contentions between sched_other tasks whereas a cpu runs sched_idle
> > task.
>
> I don't think scheduling delay is the only (or dominating) factor of
> extra latency. Here are some data to show it.
>
> I measured IPC (instructions per cycle) of the main workload under
> different scenarios:
>
> side-load | cpu.headroom | side/cpu.weight  | IPC
> ----------------------------------------------------
>  none     |     0%       |       N/A        | 0.66
>  ffmpeg   |     0%       |    SCHED_IDLE    | 0.53
>  ffmpeg   |    20%       |    SCHED_IDLE    | 0.58
>  ffmpeg   |    30%       |    SCHED_IDLE    | 0.62
>
> These data show that the side workload has a negative impact on the
> main workload's IPC. And cpu.headroom could help reduce this impact.
>
> Therefore, while optimizations in the wakeup path should help the
> latency; cpu.headroom would add _significant_ benefit on top of that.

It seems normal that side workload has a negative impact on IPC
because of resources sharing but your previous results showed a 42%
regression of latency with sched_idle which is can't be only linked to
resources access contention

>
> Does this assessment make sense?
>
>
> > Viresh (cced to this email) is working on improving such behavior at
> > wake up and has sent an patch related to the subject:
> > https://lkml.org/lkml/2019/4/25/251
> > I'm curious if this would improve the results.
>
> I could try it with our workload next week (I am at LSF/MM this
> week). Also, please keep in mind that this test sometimes takes
> multiple days to setup and run.

Yes. I understand. That would be good to have a simpler setup to
reproduce the behavior of your setup in order to do preliminary tests
and analyse the behavior

>
> Thanks,
> Song
>
> >
> > Regards,
> > Vincent
> >
> >>>
> >>> Thanks,
> >>> Song
> >>>
> >>
> >> Could you please share your comments and suggestions on this work? Did
> >> the results address your questions/concerns?
> >>
> >> Thanks again,
> >> Song
> >>
> >>>>
> >>>>>
> >>>>> Thanks,
> >>>>> Song
> >>>>>
> >>>>>>
> >>>>>> Morten
>