On Fri, Jun 07, 2019 at 03:35:41PM +0200, Peter Zijlstra wrote: > On Wed, Jun 05, 2019 at 09:04:02AM -0600, Jens Axboe wrote: > > How about the following plan - if folks are happy with this sched patch, > > we can queue it up for 5.3. Once that is in, I'll kill the block change > > that special cases the polled task wakeup. For 5.2, we go with Oleg's > > patch for the swap case. > > OK, works for me. I'll go write a proper patch. I now have the below; I'll queue that after the long weekend and let 0-day chew on it for a while and then push it out to tip or something. --- Subject: sched: Optimize try_to_wake_up() for local wakeups From: Peter Zijlstra <peterz@xxxxxxxxxxxxx> Date: Fri Jun 7 15:39:49 CEST 2019 Jens reported that significant performance can be had on some block workloads (XXX numbers?) by special casing local wakeups. That is, wakeups on the current task before it schedules out. Given something like the normal wait pattern: for (;;) { set_current_state(TASK_UNINTERRUPTIBLE); if (cond) break; schedule(); } __set_current_state(TASK_RUNNING); Any wakeup (on this CPU) after set_current_state() and before schedule() would benefit from this. Normal wakeups take p->pi_lock, which serializes wakeups to the same task. By eliding that we gain concurrency on: - ttwu_stat(); we already had concurrency on rq stats, this now also brings it to task stats. -ENOCARE - tracepoints; it is now possible to get multiple instances of trace_sched_waking() (and possibly trace_sched_wakeup()) for the same task. Tracers will have to learn to cope. Furthermore, p->pi_lock is used by set_special_state(), to order against TASK_RUNNING stores from other CPUs. But since this is strictly CPU local, we don't need the lock, and set_special_state()'s disabling of IRQs is sufficient. After the normal wakeup takes p->pi_lock it issues smp_mb__after_spinlock(), in order to ensure the woken task must observe prior stores before we observe the p->state. If this is CPU local, this will be satisfied with a compiler barrier, and we rely on try_to_wake_up() being a funcation call, which implies such. Since, when 'p == current', 'p->on_rq' must be true, the normal wakeup would continue into the ttwu_remote() branch, which normally is concerned with exactly this wakeup scenario, except from a remote CPU. IOW we're waking a task that is still running. In this case, we can trivially avoid taking rq->lock, all that's left from this is to set p->state. This then yields an extremely simple and fast path for 'p == current'. Cc: Qian Cai <cai@xxxxxx> Cc: mingo@xxxxxxxxxx Cc: akpm@xxxxxxxxxxxxxxxxxxxx Cc: hch@xxxxxx Cc: gkohli@xxxxxxxxxxxxxx Cc: oleg@xxxxxxxxxx Reported-by: Jens Axboe <axboe@xxxxxxxxx> Tested-by: Jens Axboe <axboe@xxxxxxxxx> Signed-off-by: Peter Zijlstra (Intel) <peterz@xxxxxxxxxxxxx> --- kernel/sched/core.c | 33 ++++++++++++++++++++++++++++----- 1 file changed, 28 insertions(+), 5 deletions(-) --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -1991,6 +1991,28 @@ try_to_wake_up(struct task_struct *p, un unsigned long flags; int cpu, success = 0; + if (p == current) { + /* + * We're waking current, this means 'p->on_rq' and 'task_cpu(p) + * == smp_processor_id()'. Together this means we can special + * case the whole 'p->on_rq && ttwu_remote()' case below + * without taking any locks. + * + * In particular: + * - we rely on Program-Order guarantees for all the ordering, + * - we're serialized against set_special_state() by virtue of + * it disabling IRQs (this allows not taking ->pi_lock). + */ + if (!(p->state & state)) + return false; + + success = 1; + trace_sched_waking(p); + p->state = TASK_RUNNING; + trace_sched_wakeup(p); + goto out; + } + /* * If we are going to wake up a thread waiting for CONDITION we * need to ensure that CONDITION=1 done by the caller can not be @@ -2000,7 +2022,7 @@ try_to_wake_up(struct task_struct *p, un raw_spin_lock_irqsave(&p->pi_lock, flags); smp_mb__after_spinlock(); if (!(p->state & state)) - goto out; + goto unlock; trace_sched_waking(p); @@ -2030,7 +2052,7 @@ try_to_wake_up(struct task_struct *p, un */ smp_rmb(); if (p->on_rq && ttwu_remote(p, wake_flags)) - goto stat; + goto unlock; #ifdef CONFIG_SMP /* @@ -2090,10 +2112,11 @@ try_to_wake_up(struct task_struct *p, un #endif /* CONFIG_SMP */ ttwu_queue(p, cpu, wake_flags); -stat: - ttwu_stat(p, cpu, wake_flags); -out: +unlock: raw_spin_unlock_irqrestore(&p->pi_lock, flags); +out: + if (success) + ttwu_stat(p, cpu, wake_flags); return success; }