[merged] smp-print-more-useful-debug-info-upon-receiving-ipi-on-an-offline-cpu.patch removed from -mm tree

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Subject: [merged] smp-print-more-useful-debug-info-upon-receiving-ipi-on-an-offline-cpu.patch removed from -mm tree
To: srivatsa.bhat@xxxxxxxxxxxxxxxxxx,bp@xxxxxxx,ego@xxxxxxxxxxxxxxxxxx,fweisbec@xxxxxxxxx,hch@xxxxxxxxxxxxx,mgalbraith@xxxxxxx,mgorman@xxxxxxx,mingo@xxxxxxxxxx,oleg@xxxxxxxxxx,paulmck@xxxxxxxxxxxxxxxxxx,peterz@xxxxxxxxxxxxx,riel@xxxxxxxxxx,rjw@xxxxxxxxxxxxx,rostedt@xxxxxxxxxxx,rusty@xxxxxxxxxxxxxxx,tglx@xxxxxxxxxxxxx,tj@xxxxxxxxxx,mm-commits@xxxxxxxxxxxxxxx
From: akpm@xxxxxxxxxxxxxxxxxxxx
Date: Mon, 09 Jun 2014 12:34:42 -0700


The patch titled
     Subject: smp: print more useful debug info upon receiving IPI on an offline CPU
has been removed from the -mm tree.  Its filename was
     smp-print-more-useful-debug-info-upon-receiving-ipi-on-an-offline-cpu.patch

This patch was dropped because it was merged into mainline or a subsystem tree

------------------------------------------------------
From: "Srivatsa S. Bhat" <srivatsa.bhat@xxxxxxxxxxxxxxxxxx>
Subject: smp: print more useful debug info upon receiving IPI on an offline CPU

There is a longstanding problem related to CPU hotplug which causes IPIs
to be delivered to offline CPUs, and the smp-call-function IPI handler
code prints out a warning whenever this is detected.  Every once in a
while this (usually harmless) warning gets reported on LKML, but so far it
has not been completely fixed.  Usually the solution involves finding out
the IPI sender and fixing it by adding appropriate synchronization with
CPU hotplug.

However, while going through one such internal bug reports, I found that
there is a significant bug in the receiver side itself (more specifically,
in stop-machine) that can lead to this problem even when the sender code
is perfectly fine.  This patchset fixes that synchronization problem in
the CPU hotplug stop-machine code.

Patch 1 adds some additional debug code to the smp-call-function
framework, to help debug such issues easily.

Patch 2 modifies the stop-machine code to ensure that any IPIs that were
sent while the target CPU was online, would be noticed and handled by that
CPU without fail before it goes offline.  Thus, this avoids scenarios
where IPIs are received on offline CPUs (as long as the sender uses proper
hotplug synchronization).


In fact, I debugged the problem by using Patch 1, and found that the
payload of the IPI was always the block layer's trigger_softirq()
function.  But I was not able to find anything wrong with the block layer
code.  That's when I started looking at the stop-machine code and realized
that there is a race-window which makes the IPI _receiver_ the culprit,
not the sender.  Patch 2 fixes that race and hence this should put an end
to most of the hard-to-debug IPI-to-offline-CPU issues.




This patch (of 2):

Today the smp-call-function code just prints a warning if we get an IPI on
an offline CPU.  This info is sufficient to let us know that something
went wrong, but often it is very hard to debug exactly who sent the IPI
and why, from this info alone.

In most cases, we get the warning about the IPI to an offline CPU,
immediately after the CPU going offline comes out of the stop-machine
phase and reenables interrupts.  Since all online CPUs participate in
stop-machine, the information regarding the sender of the IPI is already
lost by the time we exit the stop-machine loop.  So even if we dump the
stack on each CPU at this point, we won't find anything useful since all
of them will show the stack-trace of the stopper thread.  So we need a
better way to figure out who sent the IPI and why.

To achieve this, when we detect an IPI targeted to an offline CPU, loop
through the call-single-data linked list and print out the payload (i.e.,
the name of the function which was supposed to be executed by the target
CPU).  This would give us an insight as to who might have sent the IPI and
help us debug this further.

[akpm@xxxxxxxxxxxxxxxxxxxx: correctly suppress warning output on second and later occurrences]
Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@xxxxxxxxxxxxxxxxxx>
Cc: Peter Zijlstra <peterz@xxxxxxxxxxxxx>
Cc: Thomas Gleixner <tglx@xxxxxxxxxxxxx>
Cc: Ingo Molnar <mingo@xxxxxxxxxx>
Cc: Tejun Heo <tj@xxxxxxxxxx>
Cc: Rusty Russell <rusty@xxxxxxxxxxxxxxx>
Cc: Frederic Weisbecker <fweisbec@xxxxxxxxx>
Cc: Christoph Hellwig <hch@xxxxxxxxxxxxx>
Cc: Mel Gorman <mgorman@xxxxxxx>
Cc: Rik van Riel <riel@xxxxxxxxxx>
Cc: Borislav Petkov <bp@xxxxxxx>
Cc: Steven Rostedt <rostedt@xxxxxxxxxxx>
Cc: Mike Galbraith <mgalbraith@xxxxxxx>
Cc: Gautham R Shenoy <ego@xxxxxxxxxxxxxxxxxx>
Cc: "Paul E. McKenney" <paulmck@xxxxxxxxxxxxxxxxxx>
Cc: Oleg Nesterov <oleg@xxxxxxxxxx>
Cc: Rafael J. Wysocki <rjw@xxxxxxxxxxxxx>
Signed-off-by: Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx>
---

 kernel/smp.c |   18 +++++++++++++++---
 1 file changed, 15 insertions(+), 3 deletions(-)

diff -puN kernel/smp.c~smp-print-more-useful-debug-info-upon-receiving-ipi-on-an-offline-cpu kernel/smp.c
--- a/kernel/smp.c~smp-print-more-useful-debug-info-upon-receiving-ipi-on-an-offline-cpu
+++ a/kernel/smp.c
@@ -185,14 +185,26 @@ void generic_smp_call_function_single_in
 {
 	struct llist_node *entry;
 	struct call_single_data *csd, *csd_next;
+	static bool warned;
+
+	entry = llist_del_all(&__get_cpu_var(call_single_queue));
+	entry = llist_reverse_order(entry);
 
 	/*
 	 * Shouldn't receive this interrupt on a cpu that is not yet online.
 	 */
-	WARN_ON_ONCE(!cpu_online(smp_processor_id()));
+	if (unlikely(!cpu_online(smp_processor_id()) && !warned)) {
+		warned = true;
+		WARN(1, "IPI on offline CPU %d\n", smp_processor_id());
 
-	entry = llist_del_all(&__get_cpu_var(call_single_queue));
-	entry = llist_reverse_order(entry);
+		/*
+		 * We don't have to use the _safe() variant here
+		 * because we are not invoking the IPI handlers yet.
+		 */
+		llist_for_each_entry(csd, entry, llist)
+			pr_warn("IPI callback %pS sent to offline CPU\n",
+				csd->func);
+	}
 
 	llist_for_each_entry_safe(csd, csd_next, entry, llist) {
 		csd->func(csd->info);
_

Patches currently in -mm which might be from srivatsa.bhat@xxxxxxxxxxxxxxxxxx are

origin.patch
cpu-hotplug-smp-flush-any-pending-ipi-callbacks-before-cpu-offline.patch
cpu-hotplug-smp-flush-any-pending-ipi-callbacks-before-cpu-offline-checkpatch-fixes.patch
linux-next.patch

--
To unsubscribe from this list: send the line "unsubscribe mm-commits" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html




[Index of Archives]     [Kernel Newbies FAQ]     [Kernel Archive]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [Bugtraq]     [Photo]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]

  Powered by Linux