On Thu, Sep 10, 2015 at 07:55:57PM +0200, Oleg Nesterov wrote: > On 09/10, Boqun Feng wrote: > > > > On Wed, Sep 09, 2015 at 12:28:22PM -0700, Paul E. McKenney wrote: > > > My feeling is > > > that we should avoid saying too much about the internals of wait_event() > > > and wake_up(). > > I feel the same. I simply can't understand what we are trying to > document ;) So I've been sitting on this for a while and figured I'd finish it now. It are some notes on the scheduler locking and how it provides program order guarantees on SMP systems. Included in it are some of the details on this subject, because a wakeup has two prior states that are of importance, the tasks own prior state and the wakeup state, both should be considered in the 'program order' flow. So maybe we can reduce the description in memory-barriers to this 'split' program order guarantee, where a woken task must observe both its own prior state and its wakee state. --- kernel/sched/core.c | 137 ++++++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 137 insertions(+) diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 6ab415aa15c4..4fffde6f7c08 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -1897,6 +1897,143 @@ static void ttwu_queue(struct task_struct *p, int cpu) raw_spin_unlock(&rq->lock); } +/* + * Notes on process order guarantees on SMP systems. + * + * + * PREEMPTION/MIGRATION + * + * Regular preemption/migration is safe because as long as the task is runnable + * migrations involve both rq locks, albeit not (necessarily) at the same time. + * + * So we get (we allow 3 CPU migrations): + * + * CPU0 CPU1 CPU2 + * + * LOCK rq(0)->lock + * sched-out X + * sched-in Y + * UNLOCK rq(0)->lock + * + * LOCK rq(0)->lock // MB against CPU0 + * dequeue X + * UNLOCK rq(0)->lock + * + * LOCK rq(1)->lock + * enqueue X + * UNLOCK rq(1)->lock + * + * LOCK rq(1)->lock // MB against CPU2 + * sched-out Z + * sched-in X + * UNLOCK rq(1)->lock + * + * and the first LOCK rq(0) on CPU2 gives a full order against the UNLOCK rq(0) + * on CPU0. Similarly the LOCK rq(1) on CPU1 provides full order against the + * UNLOCK rq(1) on CPU2, therefore by the time task X runs on CPU1 it must + * observe the state it left behind on CPU0. + * + * + * BLOCKING -- aka. SLEEP + WAKEUP + * + * For blocking things are a little more interesting, because when we dequeue + * the task, we don't need to acquire the old rq lock in order to migrate it. + * + * Say CPU0 does a wait_event() and CPU1 does the wake() and migrates the task + * to CPU2 (the most complex example): + * + * CPU0 (schedule) CPU1 (try_to_wake_up) CPU2 (sched_ttwu_pending) + * + * X->state = UNINTERRUPTIBLE + * MB + * if (cond) + * break + * cond = true + * + * WMB WMB (aka smp_mb__before_spinlock) + * LOCK rq(0)->lock LOCK X->pi_lock + * + * dequeue X + * while (X->on_cpu) + * cpu_relax() + * sched-out X + * WMB + * X->on_cpu = 0 + * RMB + * X->state = WAKING + * set_task_cpu(X,1) + * WMB + * ti(X)->cpu = 2 + * + * llist_add(X, rq(2)) // MB + * llist_del_all() // MB + * + * LOCK rq(2)->lock + * enqueue X + * X->state = RUNNING + * UNLOCK rq(2)->lock + * + * LOCK rq(2)->lock + * sched-out Z + * sched-in X + * UNLOCK rq(1)->lock + * + * if (cond) // _TRUE_ + * UNLOCK X->pi_lock + * UNLOCK rq(0)->lock + * + * So in this case the scheduler does not provide an obvious full barrier; but + * stores on CPU0 cannot get delayed past the MB, the llist primitives order + * between CPU1 and CPU2 and X's loads on CPU2 cannot get before the last LOCK. + * + * Which again leads to the guarantee that by the time X gets to run on CPU2 + * it must observe the state it left behind on CPU0. + * + * However; for blocking there is a second guarantee we must provide, namely we + * must observe the state that lead to our wakeup. That is, not only must X + * observe its own prior state, it must also observe the @cond store. + * + * This too is achieved in the above, purely by the llist primitives ordering + * CPU1 to CPU2. + * + * There is however a much more interesting case for this guarantee, where X + * never makes it off CPU0: + * + * CPU0 (schedule) CPU1 (try_to_wake_up) + * + * X->state = UNINTERRUPTIBLE + * MB + * if (cond) + * break + * cond = true + * + * WMB WMB (aka smp_mb__before_spinlock) + * LOCK X->pi_lock + * + * if (X->on_rq) + * LOCK rq(0)->lock + * X->state = RUNNING + * UNLOCK rq(0)->lock + * + * LOCK rq(0)->lock // MB against CPU1 + * UNLOCK rq(0)->lock + * + * if (cond) // _TRUE_ + * + * UNLOCK X->pi_lock + * + * Here our task X never quite leaves the CPU, the wakeup happens before we can + * dequeue and schedule someone else. In this case we must still observe cond + * after our call to schedule() completes. + * + * This is achieved by the smp_mb__before_spinlock() WMB which ensures the store + * cannot leak inside the LOCK, and LOCK rq(0)->lock on CPU0 provides full order + * against the UNLOCK rq(0)->lock from CPU1. Furthermore our load of cond cannot + * happen before this same LOCK. + * + * Therefore, again, we're good. + */ + /** * try_to_wake_up - wake up a thread * @p: the thread to be awakened -- To unsubscribe from this list: send the line "unsubscribe linux-doc" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html