Christian Borntraeger <borntraeger@xxxxxxxxxx> writes: > On 23.07.21 18:21, Mel Gorman wrote: >> On Fri, Jul 23, 2021 at 02:36:21PM +0200, Christian Borntraeger wrote: >>>> sched: Do not select highest priority task to run if it should be skipped >>>> >>>> <SNIP> >>>> >>>> index 44c452072a1b..ddc0212d520f 100644 >>>> --- a/kernel/sched/fair.c >>>> +++ b/kernel/sched/fair.c >>>> @@ -4522,7 +4522,8 @@ pick_next_entity(struct cfs_rq *cfs_rq, struct sched_entity *curr) >>>> se = second; >>>> } >>>> - if (cfs_rq->next && wakeup_preempt_entity(cfs_rq->next, left) < 1) { >>>> + if (cfs_rq->next && >>>> + (cfs_rq->skip == left || wakeup_preempt_entity(cfs_rq->next, left) < 1)) { >>>> /* >>>> * Someone really wants this to run. If it's not unfair, run it. >>>> */ >>>> >>> >>> I do see a reduction in ignored yields, but from a performance aspect for my >>> testcases this patch does not provide a benefit, while the the simple >>> curr->vruntime += sysctl_sched_min_granularity; >>> does. >> I'm still not a fan because vruntime gets distorted. From the docs >> Small detail: on "ideal" hardware, at any time all tasks would have the >> same >> p->se.vruntime value --- i.e., tasks would execute simultaneously and no task >> would ever get "out of balance" from the "ideal" share of CPU time >> If yield_to impacts this "ideal share" then it could have other >> consequences. >> I think your patch may be performing better in your test case because every >> "wrong" task selected that is not the yield_to target gets penalised and >> so the yield_to target gets pushed up the list. >> >>> I still think that your approach is probably the cleaner one, any chance to improve this >>> somehow? >>> >> Potentially. The patch was a bit off because while it noticed that skip >> was not being obeyed, the fix was clumsy and isolated. The current flow is >> 1. pick se == left as the candidate >> 2. try pick a different se if the "ideal" candidate is a skip candidate >> 3. Ignore the se update if next or last are set >> Step 3 looks off because it ignores skip if next or last buddies are set >> and I don't think that was intended. Can you try this? >> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c >> index 44c452072a1b..d56f7772a607 100644 >> --- a/kernel/sched/fair.c >> +++ b/kernel/sched/fair.c >> @@ -4522,12 +4522,12 @@ pick_next_entity(struct cfs_rq *cfs_rq, struct sched_entity *curr) >> se = second; >> } >> - if (cfs_rq->next && wakeup_preempt_entity(cfs_rq->next, left) < 1) { >> + if (cfs_rq->next && wakeup_preempt_entity(cfs_rq->next, se) < 1) { >> /* >> * Someone really wants this to run. If it's not unfair, run it. >> */ >> se = cfs_rq->next; >> - } else if (cfs_rq->last && wakeup_preempt_entity(cfs_rq->last, left) < 1) { >> + } else if (cfs_rq->last && wakeup_preempt_entity(cfs_rq->last, se) < 1) { >> /* >> * Prefer last buddy, try to return the CPU to a preempted task. >> */ >> > > This one alone does not seem to make a difference. Neither in ignored yield, nor > in performance. > > Your first patch does really help in terms of ignored yields when > all threads are pinned to one host CPU. After that we do have no ignored yield > it seems. But it does not affect the performance of my testcase. > I did some more experiments and I removed the wakeup_preempt_entity checks in > pick_next_entity - assuming that this will result in source always being stopped > and target always being picked. But still, no performance difference. > As soon as I play with vruntime I do see a difference (but only without the cpu cgroup > controller). I will try to better understand the scheduler logic and do some more > testing. If you have anything that I should test, let me know. > > Christian If both yielder and target are in the same cpu cgroup or the cpu cgroup is disabled (ie, if cfs_rq_of(p->se) matches), you could try if (p->se.vruntime > rq->curr->se.vruntime) swap(p->se.vruntime, rq->curr->se.vruntime) as well as the existing buddy flags, as an entirely fair vruntime boost to the target. For when they aren't direct siblings, you /could/ use find_matching_se, but it's much less clear that's desirable, since it would yield vruntime for the entire hierarchy to the target's hierarchy.