Re: [PATCH 0/7] introduce cpu.headroom knob to cpu controller

Song Liu <songliubraving@xxxxxx> · Tue, 30 Apr 2019 16:54:18 +0000

> On Apr 30, 2019, at 12:20 PM, Vincent Guittot <vincent.guittot@xxxxxxxxxx> wrote:
> 
> Hi Song,
> 
> On Tue, 30 Apr 2019 at 08:11, Song Liu <songliubraving@xxxxxx> wrote:
>> 
>> 
>> 
>>> On Apr 29, 2019, at 8:24 AM, Vincent Guittot <vincent.guittot@xxxxxxxxxx> wrote:
>>> 
>>> Hi Song,
>>> 
>>> On Sun, 28 Apr 2019 at 21:47, Song Liu <songliubraving@xxxxxx> wrote:
>>>> 
>>>> Hi Morten and Vincent,
>>>> 
>>>>> On Apr 22, 2019, at 6:22 PM, Song Liu <songliubraving@xxxxxx> wrote:
>>>>> 
>>>>> Hi Vincent,
>>>>> 
>>>>>> On Apr 17, 2019, at 5:56 AM, Vincent Guittot <vincent.guittot@xxxxxxxxxx> wrote:
>>>>>> 
>>>>>> On Wed, 10 Apr 2019 at 21:43, Song Liu <songliubraving@xxxxxx> wrote:
>>>>>>> 
>>>>>>> Hi Morten,
>>>>>>> 
>>>>>>>> On Apr 10, 2019, at 4:59 AM, Morten Rasmussen <morten.rasmussen@xxxxxxx> wrote:
>>>>>>>> 
>>>>>> 
>>>>>>>> 
>>>>>>>> The bit that isn't clear to me, is _why_ adding idle cycles helps your
>>>>>>>> workload. I'm not convinced that adding headroom gives any latency
>>>>>>>> improvements beyond watering down the impact of your side jobs. AFAIK,
>>>>>>> 
>>>>>>> We think the latency improvements actually come from watering down the
>>>>>>> impact of side jobs. It is not just statistically improving average
>>>>>>> latency numbers, but also reduces resource contention caused by the side
>>>>>>> workload. I don't know whether it is from reducing contention of ALUs,
>>>>>>> memory bandwidth, CPU caches, or something else, but we saw reduced
>>>>>>> latencies when headroom is used.
>>>>>>> 
>>>>>>>> the throttling mechanism effectively removes the throttled tasks from
>>>>>>>> the schedule according to a specific duty cycle. When the side job is
>>>>>>>> not throttled the main workload is experiencing the same latency issues
>>>>>>>> as before, but by dynamically tuning the side job throttling you can
>>>>>>>> achieve a better average latency. Am I missing something?
>>>>>>>> 
>>>>>>>> Have you looked at your distribution of main job latency and tried to
>>>>>>>> compare with when throttling is active/not active?
>>>>>>> 
>>>>>>> cfs_bandwidth adjusts allowed runtime for each task_group each period
>>>>>>> (configurable, 100ms by default). cpu.headroom logic applies gentle
>>>>>>> throttling, so that the side workload gets some runtime in every period.
>>>>>>> Therefore, if we look at time window equal to or bigger than 100ms, we
>>>>>>> don't really see "throttling active time" vs. "throttling inactive time".
>>>>>>> 
>>>>>>>> 
>>>>>>>> I'm wondering if the headroom solution is really the right solution for
>>>>>>>> your use-case or if what you are really after is something which is
>>>>>>>> lower priority than just setting the weight to 1. Something that
>>>>>>> 
>>>>>>> The experiments show that, cpu.weight does proper work for priority: the
>>>>>>> main workload gets priority to use the CPU; while the side workload only
>>>>>>> fill the idle CPU. However, this is not sufficient, as the side workload
>>>>>>> creates big enough contention to impact the main workload.
>>>>>>> 
>>>>>>>> (nearly) always gets pre-empted by your main job (SCHED_BATCH and
>>>>>>>> SCHED_IDLE might not be enough). If your main job consist
>>>>>>>> of lots of relatively short wake-ups things like the min_granularity
>>>>>>>> could have significant latency impact.
>>>>>>> 
>>>>>>> cpu.headroom gives benefits in addition to optimizations in pre-empt
>>>>>>> side. By maintaining some idle time, fewer pre-empt actions are
>>>>>>> necessary, thus the main workload will get better latency.
>>>>>> 
>>>>>> I agree with Morten's proposal, SCHED_IDLE should help your latency
>>>>>> problem because side job will be directly preempted unlike normal cfs
>>>>>> task even lowest priority.
>>>>>> In addition to min_granularity, sched_period also has an impact on the
>>>>>> time that a task has to wait before preempting the running task. Also,
>>>>>> some sched_feature like GENTLE_FAIR_SLEEPERS can also impact the
>>>>>> latency of a task.
>>>>>> 
>>>>>> It would be nice to know if the latency problem comes from contention
>>>>>> on cache resources or if it's mainly because you main load waits
>>>>>> before running on a CPU
>>>>>> 
>>>>>> Regards,
>>>>>> Vincent
>>>>> 
>>>>> Thanks for these suggestions. Here are some more tests to show the impact
>>>>> of scheduler knobs and cpu.headroom.
>>>>> 
>>>>> side-load | cpu.headroom | side/cpu.weight | min_gran | cpu-idle | main/latency
>>>>> --------------------------------------------------------------------------------
>>>>> none    |      0       |     n/a         |    1 ms  |  45.20%  |   1.00
>>>>> ffmpeg   |      0       |      1          |   10 ms  |   3.38%  |   1.46
>>>>> ffmpeg   |      0       |   SCHED_IDLE    |    1 ms  |   5.69%  |   1.42
>>>>> ffmpeg   |    20%       |   SCHED_IDLE    |    1 ms  |  19.00%  |   1.13
>>>>> ffmpeg   |    30%       |   SCHED_IDLE    |    1 ms  |  27.60%  |   1.08
>>>>> 
>>>>> In all these cases, the main workload is loaded with same level of
>>>>> traffic (request per second). Main workload latency numbers are normalized
>>>>> based on the baseline (first row).
>>>>> 
>>>>> For the baseline, the main workload runs without any side workload, the
>>>>> system has about 45.20% idle CPU.
>>>>> 
>>>>> The next two rows compare the impact of scheduling knobs cpu.weight and
>>>>> sched_min_granularity. With cpu.weight of 1 and min_granularity of 10ms,
>>>>> we see a latency of 1.46; with SCHED_IDLE and min_granularity of 1ms, we
>>>>> see a latency of 1.42. So SCHED_IDLE and min_granularity help protecting
>>>>> the main workload. However, it is not sufficient, as the latency overhead
>>>>> is high (>40%).
>>>>> 
>>>>> The last two rows show the benefit of cpu.headroom. With 20% headroom,
>>>>> the latency is 1.13; while with 30% headroom, the latency is 1.08.
>>>>> 
>>>>> We can also see a clear correlation between latency and global idle CPU:
>>>>> more idle CPU yields better lower latency.
>>>>> 
>>>>> Over all, these results show that cpu.headroom provides effective
>>>>> mechanism to control the latency impact of side workloads. Other knobs
>>>>> could also help the latency, but they are not as effective and flexible
>>>>> as cpu.headroom.
>>>>> 
>>>>> Does this analysis address your concern?
>>> 
>>> So, you results show that sched_idle class doesn't provide the
>>> intended behavior because it still delay the scheduling of sched_other
>>> tasks. In fact, the wakeup path of the scheduler doesn't make any
>>> difference between a cpu running a sched_other and a cpu running a
>>> sched_idle when looking for the idlest cpu and it can create some
>>> contentions between sched_other tasks whereas a cpu runs sched_idle
>>> task.
>> 
>> I don't think scheduling delay is the only (or dominating) factor of
>> extra latency. Here are some data to show it.
>> 
>> I measured IPC (instructions per cycle) of the main workload under
>> different scenarios:
>> 
>> side-load | cpu.headroom | side/cpu.weight  | IPC
>> ----------------------------------------------------
>> none     |     0%       |       N/A        | 0.66
>> ffmpeg   |     0%       |    SCHED_IDLE    | 0.53
>> ffmpeg   |    20%       |    SCHED_IDLE    | 0.58
>> ffmpeg   |    30%       |    SCHED_IDLE    | 0.62
>> 
>> These data show that the side workload has a negative impact on the
>> main workload's IPC. And cpu.headroom could help reduce this impact.
>> 
>> Therefore, while optimizations in the wakeup path should help the
>> latency; cpu.headroom would add _significant_ benefit on top of that.
> 
> It seems normal that side workload has a negative impact on IPC
> because of resources sharing but your previous results showed a 42%
> regression of latency with sched_idle which is can't be only linked to
> resources access contention

Agreed. I think both scheduling latency and resource contention 
contribute noticeable latency overhead to the main workload. The 
scheduler optimization by Viresh would help reduce the scheduling
latency, but it won't help the resource contention. Hopefully, with 
optimizations in the scheduler, we can meet the latency target with 
smaller cpu.headroom. However, I don't think scheduler optimizations 
will eliminate the need of cpu.headroom, as the resource contention
always exists, and the impact could be significant. 

Do you have further concerns with this patchset?

Thanks,
Song 

>> 
>> Does this assessment make sense?
>> 
>> 
>>> Viresh (cced to this email) is working on improving such behavior at
>>> wake up and has sent an patch related to the subject:
>>> https://lkml.org/lkml/2019/4/25/251
>>> I'm curious if this would improve the results.
>> 
>> I could try it with our workload next week (I am at LSF/MM this
>> week). Also, please keep in mind that this test sometimes takes
>> multiple days to setup and run.
> 
> Yes. I understand. That would be good to have a simpler setup to
> reproduce the behavior of your setup in order to do preliminary tests
> and analyse the behavior
> 
>> 
>> Thanks,
>> Song
>> 
>>> 
>>> Regards,
>>> Vincent
>>> 
>>>>> 
>>>>> Thanks,
>>>>> Song
>>>>> 
>>>> 
>>>> Could you please share your comments and suggestions on this work? Did
>>>> the results address your questions/concerns?
>>>> 
>>>> Thanks again,
>>>> Song
>>>> 
>>>>>> 
>>>>>>> 
>>>>>>> Thanks,
>>>>>>> Song
>>>>>>> 
>>>>>>>> 
>>>>>>>> Morten