On Wed, Jan 29, 2020 at 06:38:52PM +0100, Peter Zijlstra wrote: > On Tue, Jan 28, 2020 at 09:10:12AM +0000, Mel Gorman wrote: > > Peter, Ingo and Vincent -- I know the timing is bad due to the merge > > window but do you have any thoughts on allowing select_idle_sibling to > > stack a wakee task on the same CPU as a waker in this specific case? > > I sort of see, but *groan*... > This is the reaction I kinda expected. Sync wakeups and select_idle_sibling probably caused someone PTSD at some point before my time in kernel/sched/. > so if the kworker unlocks a contended mutex/rwsem/completion... > > I suppose the fact that it limits it to tasks that were running on the > same CPU limits the impact if we do get it wrong. > And it's limited to no other task currently running on the CPU. Now, potentially multiple sleepers are on that CPU waiting for a mutex/rwsem/completion but it's very unlikely and mostly likely due to the machine being saturated in which case searching for an idle CPU will probably fail. It would also be bound by a small window after the first wakeup before the task becomes runnable before the nr_running check mitigages the problem. Besides, if the sleeping task is waiting on the lock, it *is* related to the kworker which is probably finished. In other words, even this patches worst-case behaviour does not seem that bad. > Elsewhere you write: > > > I would prefer the wakeup code did not have to signal that it's a > > synchronous wakeup. Sync wakeups so exist but callers got it wrong many > > times where stacking was allowed and then the waker did not go to sleep. > > While the chain of events are related, they are not related in a very > > obvious way. I think it's much safer to keep this as a scheduler > > heuristic instead of depending on callers to have sufficient knowledge > > of the scheduler implementation. > > That is true; the existing WF_SYNC has caused many issues for maybe > being too strong. > Exactly. It ended up being almost ignored. It basically just means that the waker CPU may be used as the target for wake_affine because the users were not obeying the contract. > But what if we create a new hint that combines both these ideas? Say > WF_COMPLETE and subject that to these same criteria. This way we can > eliminate wakeups from locks and such (they won't have this set). > I think that'll end up with three consequences. First, it falls foul of Rusty's Rules of API Design[1] because even if people read the implementation and the documentation, they might still get it wrong like what happened with WF_SYNC. Second, some other subsystem will think it's special and use the flag because it happens to work for one benchmark or worse, they happened to copy/paste the code for some reason. Finally, the workqueue implementation may change in some way that renders the use of the flag incorrect. With this patch, if workqueues change design, it's more likely the patch becomes a no-op. > Or am I just making things complicated again? I think so but I also wrote the patch so I'm biased. I think the callers would be forced into an API change if it's a common pattern where multiple unbound tasks can sleep on the same CPU waiting on a single kworker and I struggle to think of such an example. The length of time it took this issue to be detected and patched is indicative that not everyone is familiar with kernel/sched/fair.c and its consequences. If they were, chances are they would have implemented some mental hack like binding a task to a single CPU until the IO completes. [1] http://sweng.the-davies.net/Home/rustys-api-design-manifesto -- Mel Gorman SUSE Labs