Re: [PATCH] sched, fair: Allow a per-cpu kthread waking a task to stack on the same CPU

Mel Gorman <mgorman@xxxxxxxxxxxxxxxxxxx> · Thu, 30 Jan 2020 00:43:34 +0000

On Wed, Jan 29, 2020 at 06:38:52PM +0100, Peter Zijlstra wrote:
> On Tue, Jan 28, 2020 at 09:10:12AM +0000, Mel Gorman wrote:
> > Peter, Ingo and Vincent -- I know the timing is bad due to the merge
> > window but do you have any thoughts on allowing select_idle_sibling to
> > stack a wakee task on the same CPU as a waker in this specific case?
> 
> I sort of see, but *groan*...
> 

This is the reaction I kinda expected. Sync wakeups and select_idle_sibling
probably caused someone PTSD at some point before my time in kernel/sched/.

> so if the kworker unlocks a contended mutex/rwsem/completion...
> 
> I suppose the fact that it limits it to tasks that were running on the
> same CPU limits the impact if we do get it wrong.
> 

And it's limited to no other task currently running on the
CPU. Now, potentially multiple sleepers are on that CPU waiting for
a mutex/rwsem/completion but it's very unlikely and mostly likely due
to the machine being saturated in which case searching for an idle CPU
will probably fail. It would also be bound by a small window after the
first wakeup before the task becomes runnable before the nr_running check
mitigages the problem. Besides, if the sleeping task is waiting on the
lock, it *is* related to the kworker which is probably finished.

In other words, even this patches worst-case behaviour does not seem
that bad.

> Elsewhere you write:
> 
> > I would prefer the wakeup code did not have to signal that it's a
> > synchronous wakeup. Sync wakeups so exist but callers got it wrong many
> > times where stacking was allowed and then the waker did not go to sleep.
> > While the chain of events are related, they are not related in a very
> > obvious way. I think it's much safer to keep this as a scheduler
> > heuristic instead of depending on callers to have sufficient knowledge
> > of the scheduler implementation.
> 
> That is true; the existing WF_SYNC has caused many issues for maybe
> being too strong.
> 

Exactly. It ended up being almost ignored. It basically just means that
the waker CPU may be used as the target for wake_affine because the
users were not obeying the contract.

> But what if we create a new hint that combines both these ideas? Say
> WF_COMPLETE and subject that to these same criteria. This way we can
> eliminate wakeups from locks and such (they won't have this set).
> 

I think that'll end up with three consequences. First, it falls foul
of Rusty's Rules of API Design[1] because even if people read the
implementation and the documentation, they might still get it wrong like
what happened with WF_SYNC.  Second, some other subsystem will think it's
special and use the flag because it happens to work for one benchmark or
worse, they happened to copy/paste the code for some reason. Finally,
the workqueue implementation may change in some way that renders the
use of the flag incorrect. With this patch, if workqueues change design,
it's more likely the patch becomes a no-op.

> Or am I just making things complicated again?

I think so but I also wrote the patch so I'm biased. I think the callers
would be forced into an API change if it's a common pattern where multiple
unbound tasks can sleep on the same CPU waiting on a single kworker and
I struggle to think of such an example.

The length of time it took this issue to be detected and patched is
indicative that not everyone is familiar with kernel/sched/fair.c and its
consequences.  If they were, chances are they would have implemented some
mental hack like binding a task to a single CPU until the IO completes.

[1] http://sweng.the-davies.net/Home/rustys-api-design-manifesto

-- 
Mel Gorman
SUSE Labs