Re: [PATCH v12 8/8] sched/fair: Add latency list

Shrikanth Hegde <sshegde@xxxxxxxxxxxxxxxxxx> · Tue, 7 Mar 2023 16:20:25 +0530

On 3/7/23 3:49 PM, Vincent Guittot wrote:
> Le mardi 07 mars 2023 à 00:34:49 (+0530), Shrikanth Hegde a écrit :
>>> Le lundi 06 mars 2023 à 17:03:30 (+0530), Shrikanth Hegde a écrit :
>>>> On 3/5/23 6:33 PM, Vincent Guittot wrote:
>>>>> On Sat, 4 Mar 2023 at 16:13, Shrikanth Hegde <sshegde@xxxxxxxxxxxxxxxxxx> wrote:
>>>>>> On 3/3/23 10:01 PM, Vincent Guittot wrote:
>>>>>>> Le jeudi 02 mars 2023 à 23:37:52 (+0530), Shrikanth Hegde a écrit :
>>>>>>>> On 3/2/23 8:30 PM, Shrikanth Hegde wrote:
>>>>>>>>> On 3/2/23 6:47 PM, Vincent Guittot wrote:
>>>>>>>>>> On Thu, 2 Mar 2023 at 12:00, Shrikanth Hegde <sshegde@xxxxxxxxxxxxxxxxxx> wrote:
>>>>>>>>>>> On 3/2/23 1:20 PM, Vincent Guittot wrote:
>>>>>>>>>>>> On Wed, 1 Mar 2023 at 19:48, shrikanth hegde <sshegde@xxxxxxxxxxxxxxxxxx> wrote:
>>>>>>>>>>>>> On 2/24/23 3:04 PM, Vincent Guittot wrote:
>>>>>>> [...]
>>>>>>>
>>>>>>>>>>>>> Ran the schbench and hackbench with this patch series. Here comparison is
>>>>>>>>>>>>> between 6.2 stable tree, 6.2 + Patch and 6.2 + patch + above re-arrange of
>>>>>>>>>>>>> latency_node. Ran two cgroups, in one cgroup running stress-ng at 50%(group1)
>>>>>>>>>>>>> and other is running these benchmarks (group2). Set the latency nice
>>>>>>>>>>>>> of group2 to -20. These are run on Power system with 12 cores with SMT=8.
>>>>>>>>>>>>> Total of 96 CPU.
>>>>>>>>>>>>>
>>>>>>>>>>>>> schbench gets lower latency compared to stabletree. Whereas hackbench seems
>>>>>>>>>>>>> to regress under this case. Maybe i am doing something wrong. I will re-run
>>>>>>>>>>>>> and attach the numbers to series.
>>>>>>>>>>>>> Please suggest if any variation in the test i need to try.
>>>>>>>>>>>> hackbench takes advanatge of a latency nice 19 as it mainly wants to
>>>>>>>>>>>> run longer slice to move forward rather than preempting others all the
>>>>>>>>>>>> time
>>>>>>>>>>> hackbench still seems to regress in different latency nice values compared to
>>>>>>>>>>> baseline of 6.2 in this case. up to 50% in some cases.
>>>>>>>>>>>
>>>>>>>>>>> 12 core powerpc system  with SMT=8 i.e 96 CPU
>>>>>>>>>>> running 2 CPU cgroups. No quota assigned.
>>>>>>>>>>> 1st cgroup is running stress-ng with 48 threads. Consuming 50% of CPU.
>>>>>>>>>>> latency is not changed for this cgroup.
>>>>>>>>>>> 2nd cgroup is running hackbench. This cgroup is assigned the different latency
>>>>>>>>>>> nice values of 0, -20 and 19.
>>>>>>>>>> According to your other emails, you are using the cgroup interface and
>>>>>>>>>> not the task's one. Do I get it right ?
>>>>>>>>> right. I create cgroup, attach bash command with echo $$,
>>>>>>>>> assign the latency nice to cgroup, and run hackbench from that bash prompt.
>>>>>>>>>
>>>>>>>>>> I haven't run test such tests in a cgroup but at least the test with
>>>>>>>>>> latency_nice == 0 should not make any noticeable difference. Does this
>>>>>>>>>> include the re-arrange patch that you have proposed previously ?
>> Ran the test on a different system altogether. I don't see similar regression there. 
>> In fact latency nice is helping in reducing the latency as expected.
>> It is much bigger system with 60 cores. i.e total of 480 cpu.
>>
>> Tested in the same way. created two cgroups. one is running the micro benchmarks
>> and other is running stress-ng at different utilization point.
>> This data is at 50% utilization point. Similar observations w.r.t latency
>> is seen at 0%, 25%, 75% and 100% utilization as well. 
>>
> Thanks for testing on a different system which seems to get results aligned with what
> Prateek and I have seen on our system.
>
>
>> ==========
>> schbench
>> ==========
>> 	       6.2	      6.2 + V12 + LN=0
>> Groups: 1
>> 50.0th:                 14.0             12.5
>> 75.0th:                 16.5             14.0
>> 90.0th:                 18.5             15.5
>> 95.0th:                 20.5             17.0
>> 99.0th:                 27.5             21.0
>> 99.5th:                 36.0             23.5
>> Groups: 2
>> 50.0th:                 14.0             16.0
>> 75.0th:                 17.0             18.0
>> 90.0th:                 20.0             21.0
>> 95.0th:                 23.0             23.0
>> 99.0th:                 71.0             34.0
>> 99.5th:               1170.0             96.0
>> 99.9th:               5088.0           3212.0
>> Groups: 4
>> 50.0th:                 20.5             19.5
>> 75.0th:                 24.5             22.5
>> 90.0th:                 31.0             26.0
>> 95.0th:                260.5             28.0
>> 99.0th:               3644.0             35.0
>> 99.5th:               5152.0             44.5
>> 99.9th:               8076.0            168.5
>> Groups: 8
>> 50.0th:                 26.0             25.5
>> 75.0th:                 32.5             31.5
>> 90.0th:                 41.5             36.5
>> 95.0th:                794.0             39.5
>> 99.0th:               5992.0             66.0
>> 99.5th:               7208.0            159.0
>> 99.9th:               9392.0           1604.0
>> Groups: 16
>> 50.0th:                 37.5             34.0
>> 75.0th:                 49.5             44.5
>> 90.0th:                 70.0             53.5
>> 95.0th:               1284.0             58.5
>> 99.0th:               5600.0            102.5
>> 99.5th:               7216.0            368.5
>> 99.9th:               9328.0           5192.0
>> Groups: 32
>> 50.0th:                 59.0             54.5
>> 75.0th:                 83.0             74.5
>> 90.0th:                118.5             91.0
>> 95.0th:               1921.0            100.5
>> 99.0th:               6672.0            317.0
>> 99.5th:               8252.0           2264.0
>> 99.9th:              10448.0           8388.0
>>
>>
>> ===========
>> hackbench
>> ==========
>>
>> type                 Groups      6.2      | 6.2 + V12 + LN=0
>> Process               10         0.19     |  0.19
>> Process               20         0.34     |  0.34
>> Process               30         0.45     |  0.44
>> Process               40         0.58     |  0.57
>> Process               50         0.70     |  0.69
>> Process               60         0.82     |  0.80
>> thread                10         0.20     |  0.20
>> thread                20         0.36     |  0.36
>> Process(Pipe)         10         0.24     |  0.21
>> Process(Pipe)         20         0.46     |  0.40
>> Process(Pipe)         30         0.65     |  0.58
>> Process(Pipe)         40         0.90     |  0.68
>> Process(Pipe)         50         1.04     |  0.83
>> Process(Pipe)         60         1.16     |  0.86
>> thread(Pipe)          10         0.19     |  0.18
>> thread(Pipe)          20         0.46     |  0.37
>>
>>
> [...]
>
>>>>>>
>>>>>> Do you want me to try any other experiment on this further?
>>>>> Yes, would be good to know which of the 3 changes in the patch create
>>>>> the regression
>>>>>
>>>>> I suspect the 1st change to be the root cause of your problem but It
>>>>> would be good if you can confirm my assumption with some tests
>>>>>
>>>>> Thanks
>>>> Applied each change individually. 3rd change seems to cause the regression.
>>>> Kept only the 3rd change and numbers are same as stable 6.2 for latency nice
>>>> value of 0.
>>> Ok, it's the patch 1 that aims to prevent some unfairness with low weight
>>> waking task. And your platform probably falls in the last part of the commit:
>>>
>>> " Strictly speaking, we should use cfs->min_vruntime instead of
>>> curr->vruntime but it doesn't worth the additional overhead and complexity
>>> as the vruntime of current should be close to min_vruntime if not equal."
>>>
>>> Could you try the patch below on top of v12 ?
>>>
>>> ---
>>>  kernel/sched/fair.c | 21 +++++++++++----------
>>>  1 file changed, 11 insertions(+), 10 deletions(-)
>>>
>>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>>> index e2aeb4511686..77b03a280912 100644
>>> --- a/kernel/sched/fair.c
>>> +++ b/kernel/sched/fair.c
>>> @@ -5049,7 +5049,7 @@ set_next_entity(struct cfs_rq *cfs_rq, struct sched_entity *se)
>>>  }
>>>
>>>  static int
>>> -wakeup_preempt_entity(struct sched_entity *curr, struct sched_entity *se);
>>> +wakeup_preempt_entity(struct sched_entity *curr, struct sched_entity *se, struct cfs_rq *cfs_rq);
>>>
>>>  /*
>>>   * Pick the next process, keeping these things in mind, in this order:
>>> @@ -5088,16 +5088,16 @@ pick_next_entity(struct cfs_rq *cfs_rq, struct sched_entity *curr)
>>>  				second = curr;
>>>  		}
>>>
>>> -		if (second && wakeup_preempt_entity(second, left) < 1)
>>> +		if (second && wakeup_preempt_entity(second, left, cfs_rq) < 1)
>>>  			se = second;
>>>  	}
>>>
>>> -	if (cfs_rq->next && wakeup_preempt_entity(cfs_rq->next, left) < 1) {
>>> +	if (cfs_rq->next && wakeup_preempt_entity(cfs_rq->next, left, cfs_rq) < 1) {
>>>  		/*
>>>  		 * Someone really wants this to run. If it's not unfair, run it.
>>>  		 */
>>>  		se = cfs_rq->next;
>>> -	} else if (cfs_rq->last && wakeup_preempt_entity(cfs_rq->last, left) < 1) {
>>> +	} else if (cfs_rq->last && wakeup_preempt_entity(cfs_rq->last, left, cfs_rq) < 1) {
>>>  		/*
>>>  		 * Prefer last buddy, try to return the CPU to a preempted task.
>>>  		 */
>>> @@ -5107,7 +5107,7 @@ pick_next_entity(struct cfs_rq *cfs_rq, struct sched_entity *curr)
>>>  	/* Check for latency sensitive entity waiting for running */
>>>  	latency = __pick_first_latency(cfs_rq);
>>>  	if (latency && (latency != se) &&
>>> -	    wakeup_preempt_entity(latency, se) < 1)
>>> +	    wakeup_preempt_entity(latency, se, cfs_rq) < 1)
>>>  		se = latency;
>>>
>>>  	return se;
>>> @@ -7808,7 +7808,7 @@ static unsigned long wakeup_gran(struct sched_entity *se)
>>>   *
>>>   */
>>>  static int
>>> -wakeup_preempt_entity(struct sched_entity *curr, struct sched_entity *se)
>>> +wakeup_preempt_entity(struct sched_entity *curr, struct sched_entity *se, struct cfs_rq *cfs_rq)
>>>  {
>>>  	s64 gran, vdiff = curr->vruntime - se->vruntime;
>>>  	s64 offset = wakeup_latency_gran(curr, se);
>>> @@ -7818,6 +7818,8 @@ wakeup_preempt_entity(struct sched_entity *curr, struct sched_entity *se)
>>>
>>>  	gran = offset + wakeup_gran(se);
>>>
>>> +	if (vdiff > gran)
>>> +		return 1;
>>>  	/*
>>>  	 * At wake up, the vruntime of a task is capped to not be older than
>>>  	 * a sched_latency period compared to min_vruntime. This prevents long
>>> @@ -7827,9 +7829,8 @@ wakeup_preempt_entity(struct sched_entity *curr, struct sched_entity *se)
>>>  	 * for low priority task. Make sure that long sleeping task will get a
>>>  	 * chance to preempt current.
>>>  	 */
>>> -	gran = min_t(s64, gran, get_latency_max());
>>> -
>>> -	if (vdiff > gran)
>>> +	vdiff = cfs_rq->min_vruntime - se->vruntime;
>>> +	if (vdiff > get_latency_max())
>>>  		return 1;
>>>
>>>  	return 0;
>>> @@ -7933,7 +7934,7 @@ static void check_preempt_wakeup(struct rq *rq, struct task_struct *p, int wake_
>>>  		return;
>>>
>>>  	update_curr(cfs_rq_of(se));
>>> -	if (wakeup_preempt_entity(se, pse) == 1) {
>>> +	if (wakeup_preempt_entity(se, pse, cfs_rq_of(se)) == 1) {
>>>  		/*
>>>  		 * Bias pick_next to pick the sched entity that is
>>>  		 * triggering this preemption.
>>> --
>>> 2.34.1
>> Tried above patch on top of V12. Numbers are worse than V12. We maybe running into
>> a corner case on this system.
> Yes it can be a corner case.
>
> Nevertheless, the patch above has a problem and does an unsigned comparison instead of a signed
> one. I have forced the signed comparison in the patch below to be applied on top  of
> previous one:
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 77b03a280912..22a497f92dbb 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -7830,7 +7830,7 @@ wakeup_preempt_entity(struct sched_entity *curr, struct sched_entity *se, struct
>          * chance to preempt current.
>          */
>         vdiff = cfs_rq->min_vruntime - se->vruntime;
> -       if (vdiff > get_latency_max())
> +       if (vdiff > (s64)get_latency_max())
>                 return 1;
>
>         return 0;

Tested the above patch on top of previous patch + V12. 
Numbers are still worse than V12. Same as V12+previous patch.

>
>
>> Type                Groups       6.2     | 6.2+V12                                 
>>                                                                                    
>>  Process              10        0.33     |  0.44                                   
>>  Process              20        0.61     |  0.90                                   
>>  Process              30        0.87     |  1.29                                   
>>  Process              40        1.10     |  1.69                                   
>>  Process              50        1.34     |  2.08                                   
>>  Process              60        1.58     |  2.39                                   
>>  thread               10        0.36     |  0.53                                   
>>  thread               20        0.64     |  0.94                                   
>>  Process(Pipe)        10        0.18     |  0.46                                   
>>  Process(Pipe)        20        0.32     |  0.75                                   
>>  Process(Pipe)        30        0.42     |  1.01                                   
>>  Process(Pipe)        40        0.56     |  1.15                                   
>>  Process(Pipe)        50        0.68     |  1.38                                   
>>  Process(Pipe)        60        0.80     |  1.56  
>>  
>>
>>>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>>>> index cdcd991bbcf1..c89c201dd164 100644
>>>> --- a/kernel/sched/fair.c
>>>> +++ b/kernel/sched/fair.c
>>>> @@ -7827,7 +7827,6 @@ wakeup_preempt_entity(struct sched_entity *curr, struct sched_entity *se)
>>>>          * for low priority task. Make sure that long sleeping task will get a
>>>>          * chance to preempt current.
>>>>          */
>>>> -       gran = min_t(s64, gran, get_latency_max());
>>>>
>>>>         if (vdiff > gran)
>>>>                 return 1;
>>>>
>>>>
>>>>
> [...]
>
>>