> On Apr 29, 2019, at 8:24 AM, Vincent Guittot <vincent.guittot@xxxxxxxxxx> wrote: > > Hi Song, > > On Sun, 28 Apr 2019 at 21:47, Song Liu <songliubraving@xxxxxx> wrote: >> >> Hi Morten and Vincent, >> >>> On Apr 22, 2019, at 6:22 PM, Song Liu <songliubraving@xxxxxx> wrote: >>> >>> Hi Vincent, >>> >>>> On Apr 17, 2019, at 5:56 AM, Vincent Guittot <vincent.guittot@xxxxxxxxxx> wrote: >>>> >>>> On Wed, 10 Apr 2019 at 21:43, Song Liu <songliubraving@xxxxxx> wrote: >>>>> >>>>> Hi Morten, >>>>> >>>>>> On Apr 10, 2019, at 4:59 AM, Morten Rasmussen <morten.rasmussen@xxxxxxx> wrote: >>>>>> >>>> >>>>>> >>>>>> The bit that isn't clear to me, is _why_ adding idle cycles helps your >>>>>> workload. I'm not convinced that adding headroom gives any latency >>>>>> improvements beyond watering down the impact of your side jobs. AFAIK, >>>>> >>>>> We think the latency improvements actually come from watering down the >>>>> impact of side jobs. It is not just statistically improving average >>>>> latency numbers, but also reduces resource contention caused by the side >>>>> workload. I don't know whether it is from reducing contention of ALUs, >>>>> memory bandwidth, CPU caches, or something else, but we saw reduced >>>>> latencies when headroom is used. >>>>> >>>>>> the throttling mechanism effectively removes the throttled tasks from >>>>>> the schedule according to a specific duty cycle. When the side job is >>>>>> not throttled the main workload is experiencing the same latency issues >>>>>> as before, but by dynamically tuning the side job throttling you can >>>>>> achieve a better average latency. Am I missing something? >>>>>> >>>>>> Have you looked at your distribution of main job latency and tried to >>>>>> compare with when throttling is active/not active? >>>>> >>>>> cfs_bandwidth adjusts allowed runtime for each task_group each period >>>>> (configurable, 100ms by default). cpu.headroom logic applies gentle >>>>> throttling, so that the side workload gets some runtime in every period. >>>>> Therefore, if we look at time window equal to or bigger than 100ms, we >>>>> don't really see "throttling active time" vs. "throttling inactive time". >>>>> >>>>>> >>>>>> I'm wondering if the headroom solution is really the right solution for >>>>>> your use-case or if what you are really after is something which is >>>>>> lower priority than just setting the weight to 1. Something that >>>>> >>>>> The experiments show that, cpu.weight does proper work for priority: the >>>>> main workload gets priority to use the CPU; while the side workload only >>>>> fill the idle CPU. However, this is not sufficient, as the side workload >>>>> creates big enough contention to impact the main workload. >>>>> >>>>>> (nearly) always gets pre-empted by your main job (SCHED_BATCH and >>>>>> SCHED_IDLE might not be enough). If your main job consist >>>>>> of lots of relatively short wake-ups things like the min_granularity >>>>>> could have significant latency impact. >>>>> >>>>> cpu.headroom gives benefits in addition to optimizations in pre-empt >>>>> side. By maintaining some idle time, fewer pre-empt actions are >>>>> necessary, thus the main workload will get better latency. >>>> >>>> I agree with Morten's proposal, SCHED_IDLE should help your latency >>>> problem because side job will be directly preempted unlike normal cfs >>>> task even lowest priority. >>>> In addition to min_granularity, sched_period also has an impact on the >>>> time that a task has to wait before preempting the running task. Also, >>>> some sched_feature like GENTLE_FAIR_SLEEPERS can also impact the >>>> latency of a task. >>>> >>>> It would be nice to know if the latency problem comes from contention >>>> on cache resources or if it's mainly because you main load waits >>>> before running on a CPU >>>> >>>> Regards, >>>> Vincent >>> >>> Thanks for these suggestions. Here are some more tests to show the impact >>> of scheduler knobs and cpu.headroom. >>> >>> side-load | cpu.headroom | side/cpu.weight | min_gran | cpu-idle | main/latency >>> -------------------------------------------------------------------------------- >>> none | 0 | n/a | 1 ms | 45.20% | 1.00 >>> ffmpeg | 0 | 1 | 10 ms | 3.38% | 1.46 >>> ffmpeg | 0 | SCHED_IDLE | 1 ms | 5.69% | 1.42 >>> ffmpeg | 20% | SCHED_IDLE | 1 ms | 19.00% | 1.13 >>> ffmpeg | 30% | SCHED_IDLE | 1 ms | 27.60% | 1.08 >>> >>> In all these cases, the main workload is loaded with same level of >>> traffic (request per second). Main workload latency numbers are normalized >>> based on the baseline (first row). >>> >>> For the baseline, the main workload runs without any side workload, the >>> system has about 45.20% idle CPU. >>> >>> The next two rows compare the impact of scheduling knobs cpu.weight and >>> sched_min_granularity. With cpu.weight of 1 and min_granularity of 10ms, >>> we see a latency of 1.46; with SCHED_IDLE and min_granularity of 1ms, we >>> see a latency of 1.42. So SCHED_IDLE and min_granularity help protecting >>> the main workload. However, it is not sufficient, as the latency overhead >>> is high (>40%). >>> >>> The last two rows show the benefit of cpu.headroom. With 20% headroom, >>> the latency is 1.13; while with 30% headroom, the latency is 1.08. >>> >>> We can also see a clear correlation between latency and global idle CPU: >>> more idle CPU yields better lower latency. >>> >>> Over all, these results show that cpu.headroom provides effective >>> mechanism to control the latency impact of side workloads. Other knobs >>> could also help the latency, but they are not as effective and flexible >>> as cpu.headroom. >>> >>> Does this analysis address your concern? > > So, you results show that sched_idle class doesn't provide the > intended behavior because it still delay the scheduling of sched_other > tasks. In fact, the wakeup path of the scheduler doesn't make any > difference between a cpu running a sched_other and a cpu running a > sched_idle when looking for the idlest cpu and it can create some > contentions between sched_other tasks whereas a cpu runs sched_idle > task. I don't think scheduling delay is the only (or dominating) factor of extra latency. Here are some data to show it. I measured IPC (instructions per cycle) of the main workload under different scenarios: side-load | cpu.headroom | side/cpu.weight | IPC ---------------------------------------------------- none | 0% | N/A | 0.66 ffmpeg | 0% | SCHED_IDLE | 0.53 ffmpeg | 20% | SCHED_IDLE | 0.58 ffmpeg | 30% | SCHED_IDLE | 0.62 These data show that the side workload has a negative impact on the main workload's IPC. And cpu.headroom could help reduce this impact. Therefore, while optimizations in the wakeup path should help the latency; cpu.headroom would add _significant_ benefit on top of that. Does this assessment make sense? > Viresh (cced to this email) is working on improving such behavior at > wake up and has sent an patch related to the subject: > https://lkml.org/lkml/2019/4/25/251 > I'm curious if this would improve the results. I could try it with our workload next week (I am at LSF/MM this week). Also, please keep in mind that this test sometimes takes multiple days to setup and run. Thanks, Song > > Regards, > Vincent > >>> >>> Thanks, >>> Song >>> >> >> Could you please share your comments and suggestions on this work? Did >> the results address your questions/concerns? >> >> Thanks again, >> Song >> >>>> >>>>> >>>>> Thanks, >>>>> Song >>>>> >>>>>> >>>>>> Morten