+ cpu-hotplug-stop-machine-plug-race-window-that-leads-to-ipi-to-offline-cpu.patch added to -mm tree

akpm@xxxxxxxxxxxxxxxxxxxx · Tue, 06 May 2014 15:09:42 -0700

Subject: + cpu-hotplug-stop-machine-plug-race-window-that-leads-to-ipi-to-offline-cpu.patch added to -mm tree
To: srivatsa.bhat@xxxxxxxxxxxxxxxxxx,bp@xxxxxxx,ego@xxxxxxxxxxxxxxxxxx,fweisbec@xxxxxxxxx,hch@xxxxxxxxxxxxx,mgalbraith@xxxxxxx,mgorman@xxxxxxx,mingo@xxxxxxxxxx,oleg@xxxxxxxxxx,paulmck@xxxxxxxxxxxxxxxxxx,peterz@xxxxxxxxxxxxx,riel@xxxxxxxxxx,rjw@xxxxxxxxxxxxx,rostedt@xxxxxxxxxxx,rusty@xxxxxxxxxxxxxxx,tglx@xxxxxxxxxxxxx,tj@xxxxxxxxxx
From: akpm@xxxxxxxxxxxxxxxxxxxx
Date: Tue, 06 May 2014 15:09:42 -0700


The patch titled
     Subject: CPU hotplug, stop-machine: plug race-window that leads to "IPI-to-offline-CPU"
has been added to the -mm tree.  Its filename is
     cpu-hotplug-stop-machine-plug-race-window-that-leads-to-ipi-to-offline-cpu.patch

This patch should soon appear at
    http://ozlabs.org/~akpm/mmots/broken-out/cpu-hotplug-stop-machine-plug-race-window-that-leads-to-ipi-to-offline-cpu.patch
and later at
    http://ozlabs.org/~akpm/mmotm/broken-out/cpu-hotplug-stop-machine-plug-race-window-that-leads-to-ipi-to-offline-cpu.patch

Before you just go and hit "reply", please:
   a) Consider who else should be cc'ed
   b) Prefer to cc a suitable mailing list as well
   c) Ideally: find the original patch on the mailing list and do a
      reply-to-all to that, adding suitable additional cc's

*** Remember to use Documentation/SubmitChecklist when testing your code ***

The -mm tree is included into linux-next and is updated
there every 3-4 working days

------------------------------------------------------
From: "Srivatsa S. Bhat" <srivatsa.bhat@xxxxxxxxxxxxxxxxxx>
Subject: CPU hotplug, stop-machine: plug race-window that leads to "IPI-to-offline-CPU"

During CPU offline, stop-machine is used to take control over all the
online CPUs (via the per-cpu stopper thread) and then run take_cpu_down()
on the CPU that is to be taken offline.

But stop-machine itself has several stages: _PREPARE, _DISABLE_IRQ, _RUN
etc.  The important thing to note here is that the _DISABLE_IRQ stage
comes much later after starting stop-machine, and hence there is a large
window where other CPUs can send IPIs to the CPU going offline.  As a
result, we can encounter a scenario as depicted below, which causes IPIs
to be sent to the CPU going offline, and that CPU notices them *after* it
has gone offline, triggering the "IPI-to-offline-CPU" warning from the
smp-call-function code.

              CPU 1                                         CPU 2
          (Online CPU)                               (CPU going offline)

       Enter _PREPARE stage                          Enter _PREPARE stage

                                                     Enter _DISABLE_IRQ stage

                                                   =
       Got a device interrupt,                     | Didn't notice the IPI
       and the interrupt handler                   | since interrupts were
       called smp_call_function()                  | disabled on this CPU.
       and sent an IPI to CPU 2.                   |
                                                   =

       Enter _DISABLE_IRQ stage

       Enter _RUN stage                              Enter _RUN stage

                                  =
       Busy loop with interrupts  |                  Invoke take_cpu_down()
       disabled.                  |                  and take CPU 2 offline
                                  =

       Enter _EXIT stage                             Enter _EXIT stage

       Re-enable interrupts                          Re-enable interrupts

                                                     The pending IPI is noted
                                                     immediately, but alas,
                                                     the CPU is offline at
                                                     this point.

So, as we can observe from this scenario, the IPI was sent when CPU 2 was
still online, and hence it was perfectly legal.  But unfortunately it was
noted only after CPU 2 went offline, resulting in the warning from the IPI
handling code.  In other words, the fault was not at the sender, but at
the receiver side - and if we look closely, the real bug is in the
stop-machine sequence itself.

The problem here is that the CPU going offline disabled its local
interrupts (by entering _DISABLE_IRQ phase) *before* the other CPUs.  And
that's the reason why it was not able to respond to the IPI before going
offline.

A simple solution to this problem is to ensure that the CPU going offline
*follows* all other CPUs while entering each subsequent phase within
stop-machine.  In particular, all other CPUs will enter the _DISABLE_IRQ
phase and disable their local interrupts, and only *then*, the CPU going
offline will follow suit.  Since the other CPUs are executing the
stop-machine code with interrupts disabled, they won't send any IPIs at
all, at that point.  And by the time stop-machine ends, the CPU would have
gone offline and disappeared from the cpu_online_mask, and hence future
invocations of smp_call_function() and friends will automatically prune
that CPU out.  Thus, we can guarantee that no CPU will end up
*inadvertently* sending IPIs to an offline CPU.

We can implement this by introducing a "holding area" for the CPUs marked
as 'active_cpus', and use this infrastructure to let the other CPUs
progress from one stage to the next, before allowing the active_cpus to do
the same thing.

Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@xxxxxxxxxxxxxxxxxx>
Cc: Peter Zijlstra <peterz@xxxxxxxxxxxxx>
Cc: Thomas Gleixner <tglx@xxxxxxxxxxxxx>
Cc: Ingo Molnar <mingo@xxxxxxxxxx>
Cc: Tejun Heo <tj@xxxxxxxxxx>
Cc: Rusty Russell <rusty@xxxxxxxxxxxxxxx>
Cc: Frederic Weisbecker <fweisbec@xxxxxxxxx>
Cc: Christoph Hellwig <hch@xxxxxxxxxxxxx>
Cc: Mel Gorman <mgorman@xxxxxxx>
Cc: Rik van Riel <riel@xxxxxxxxxx>
Cc: Borislav Petkov <bp@xxxxxxx>
Cc: Steven Rostedt <rostedt@xxxxxxxxxxx>
Cc: Mike Galbraith <mgalbraith@xxxxxxx>
Cc: Gautham R Shenoy <ego@xxxxxxxxxxxxxxxxxx>
Cc: "Paul E. McKenney" <paulmck@xxxxxxxxxxxxxxxxxx>
Cc: Oleg Nesterov <oleg@xxxxxxxxxx>
Cc: Rafael J. Wysocki <rjw@xxxxxxxxxxxxx>
Signed-off-by: Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx>
---

 kernel/stop_machine.c |   30 +++++++++++++++++++++++++++---
 1 file changed, 27 insertions(+), 3 deletions(-)

diff -puN kernel/stop_machine.c~cpu-hotplug-stop-machine-plug-race-window-that-leads-to-ipi-to-offline-cpu kernel/stop_machine.c

--- a/kernel/stop_machine.c~cpu-hotplug-stop-machine-plug-race-window-that-leads-to-ipi-to-offline-cpu
+++ a/kernel/stop_machine.c
@@ -165,12 +165,13 @@ static void ack_state(struct multi_stop_
 		set_state(msdata, msdata->state + 1);
 }
 
+
 /* This is the cpu_stop function which stops the CPU. */
 static int multi_cpu_stop(void *data)
 {
 	struct multi_stop_data *msdata = data;
 	enum multi_stop_state curstate = MULTI_STOP_NONE;
-	int cpu = smp_processor_id(), err = 0;
+	int cpu = smp_processor_id(), num_active_cpus, err = 0;
 	unsigned long flags;
 	bool is_active;
 
@@ -180,15 +181,38 @@ static int multi_cpu_stop(void *data)
 	 */
 	local_save_flags(flags);
 
-	if (!msdata->active_cpus)
+	if (!msdata->active_cpus) {
 		is_active = cpu == cpumask_first(cpu_online_mask);
-	else
+		num_active_cpus = 1;
+	} else {
 		is_active = cpumask_test_cpu(cpu, msdata->active_cpus);
+		num_active_cpus = cpumask_weight(msdata->active_cpus);
+	}
 
 	/* Simple state machine */
 	do {
 		/* Chill out and ensure we re-read multi_stop_state. */
 		cpu_relax();
+
+		/*
+		 * In the case of CPU offline, we don't want the other CPUs to
+		 * send IPIs to the active_cpu (the one going offline) after it
+		 * has entered the _DISABLE_IRQ state (because, then it will
+		 * notice the IPIs only after it goes offline). So ensure that
+		 * the active_cpu always follows the others while entering
+		 * each subsequent state in this state-machine.
+		 *
+		 * msdata->thread_ack tracks the number of CPUs that are yet to
+		 * move to the next state, during each transition. So make the
+		 * active_cpu(s) wait until ->thread_ack indicates that the
+		 * active_cpus are the only ones left to complete the transition.
+		 */
+		if (is_active) {
+			/* Wait until all the non-active threads ack the state */
+			while (atomic_read(&msdata->thread_ack) > num_active_cpus)
+				cpu_relax();
+		}
+
 		if (msdata->state != curstate) {
 			curstate = msdata->state;
 			switch (curstate) {
_

Patches currently in -mm which might be from srivatsa.bhat@xxxxxxxxxxxxxxxxxx are

smp-print-more-useful-debug-info-upon-receiving-ipi-on-an-offline-cpu.patch
smp-print-more-useful-debug-info-upon-receiving-ipi-on-an-offline-cpu-fix.patch
cpu-hotplug-stop-machine-plug-race-window-that-leads-to-ipi-to-offline-cpu.patch

--
To unsubscribe from this list: send the line "unsubscribe mm-commits" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html