On 27.07.21 20:57, Benjamin Segall wrote:
Christian Borntraeger <borntraeger@xxxxxxxxxx> writes:
On 23.07.21 18:21, Mel Gorman wrote:
On Fri, Jul 23, 2021 at 02:36:21PM +0200, Christian Borntraeger wrote:
sched: Do not select highest priority task to run if it should be skipped
<SNIP>
index 44c452072a1b..ddc0212d520f 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4522,7 +4522,8 @@ pick_next_entity(struct cfs_rq *cfs_rq, struct sched_entity *curr)
se = second;
}
- if (cfs_rq->next && wakeup_preempt_entity(cfs_rq->next, left) < 1) {
+ if (cfs_rq->next &&
+ (cfs_rq->skip == left || wakeup_preempt_entity(cfs_rq->next, left) < 1)) {
/*
* Someone really wants this to run. If it's not unfair, run it.
*/
I do see a reduction in ignored yields, but from a performance aspect for my
testcases this patch does not provide a benefit, while the the simple
curr->vruntime += sysctl_sched_min_granularity;
does.
I'm still not a fan because vruntime gets distorted. From the docs
Small detail: on "ideal" hardware, at any time all tasks would have the
same
p->se.vruntime value --- i.e., tasks would execute simultaneously and no task
would ever get "out of balance" from the "ideal" share of CPU time
If yield_to impacts this "ideal share" then it could have other
consequences.
I think your patch may be performing better in your test case because every
"wrong" task selected that is not the yield_to target gets penalised and
so the yield_to target gets pushed up the list.
I still think that your approach is probably the cleaner one, any chance to improve this
somehow?
Potentially. The patch was a bit off because while it noticed that skip
was not being obeyed, the fix was clumsy and isolated. The current flow is
1. pick se == left as the candidate
2. try pick a different se if the "ideal" candidate is a skip candidate
3. Ignore the se update if next or last are set
Step 3 looks off because it ignores skip if next or last buddies are set
and I don't think that was intended. Can you try this?
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 44c452072a1b..d56f7772a607 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4522,12 +4522,12 @@ pick_next_entity(struct cfs_rq *cfs_rq, struct sched_entity *curr)
se = second;
}
- if (cfs_rq->next && wakeup_preempt_entity(cfs_rq->next, left) < 1) {
+ if (cfs_rq->next && wakeup_preempt_entity(cfs_rq->next, se) < 1) {
/*
* Someone really wants this to run. If it's not unfair, run it.
*/
se = cfs_rq->next;
- } else if (cfs_rq->last && wakeup_preempt_entity(cfs_rq->last, left) < 1) {
+ } else if (cfs_rq->last && wakeup_preempt_entity(cfs_rq->last, se) < 1) {
/*
* Prefer last buddy, try to return the CPU to a preempted task.
*/
This one alone does not seem to make a difference. Neither in ignored yield, nor
in performance.
Your first patch does really help in terms of ignored yields when
all threads are pinned to one host CPU. After that we do have no ignored yield
it seems. But it does not affect the performance of my testcase.
I did some more experiments and I removed the wakeup_preempt_entity checks in
pick_next_entity - assuming that this will result in source always being stopped
and target always being picked. But still, no performance difference.
As soon as I play with vruntime I do see a difference (but only without the cpu cgroup
controller). I will try to better understand the scheduler logic and do some more
testing. If you have anything that I should test, let me know.
Christian
If both yielder and target are in the same cpu cgroup or the cpu cgroup
is disabled (ie, if cfs_rq_of(p->se) matches), you could try
if (p->se.vruntime > rq->curr->se.vruntime)
swap(p->se.vruntime, rq->curr->se.vruntime)
I tried that and it does not show the performance benefit. I then played with my
patch (uses different values to add) and the benefit seems to be depending on the
size that is being added, maybe when swapping it was just not large enough.
I have to say that this is all a bit unclear what and why performance improves.
It just seems that the cpu cgroup controller has a fair share of the performance
problems.
I also asked the performance people to run some measurements and the numbers of
some transactional workload under KVM was
baseline: 11813
with much smaller sched_latency_ns and sched_migration_cost_ns: 16419
with cpu controller disabled: 15962
with cpu controller disabled + my patch: 16782
I will be travelling the next 2 weeks, so I can continue with more debugging
after that.
Thanks for all the ideas and help so far.
Christian
as well as the existing buddy flags, as an entirely fair vruntime boost
to the target.
For when they aren't direct siblings, you /could/ use find_matching_se,
but it's much less clear that's desirable, since it would yield vruntime
for the entire hierarchy to the target's hierarchy.