+ smp-print-more-useful-debug-info-upon-receiving-ipi-on-an-offline-cpu.patch added to -mm tree

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Subject: + smp-print-more-useful-debug-info-upon-receiving-ipi-on-an-offline-cpu.patch added to -mm tree
To: srivatsa.bhat@xxxxxxxxxxxxxxxxxx,bp@xxxxxxx,ego@xxxxxxxxxxxxxxxxxx,fweisbec@xxxxxxxxx,hch@xxxxxxxxxxxxx,mgalbraith@xxxxxxx,mgorman@xxxxxxx,mingo@xxxxxxxxxx,oleg@xxxxxxxxxx,paulmck@xxxxxxxxxxxxxxxxxx,peterz@xxxxxxxxxxxxx,riel@xxxxxxxxxx,rjw@xxxxxxxxxxxxx,rostedt@xxxxxxxxxxx,rusty@xxxxxxxxxxxxxxx,tglx@xxxxxxxxxxxxx,tj@xxxxxxxxxx
From: akpm@xxxxxxxxxxxxxxxxxxxx
Date: Tue, 06 May 2014 13:41:13 -0700


The patch titled
     Subject: smp: print more useful debug info upon receiving IPI on an offline CPU
has been added to the -mm tree.  Its filename is
     smp-print-more-useful-debug-info-upon-receiving-ipi-on-an-offline-cpu.patch

This patch should soon appear at
    http://ozlabs.org/~akpm/mmots/broken-out/smp-print-more-useful-debug-info-upon-receiving-ipi-on-an-offline-cpu.patch
and later at
    http://ozlabs.org/~akpm/mmotm/broken-out/smp-print-more-useful-debug-info-upon-receiving-ipi-on-an-offline-cpu.patch

Before you just go and hit "reply", please:
   a) Consider who else should be cc'ed
   b) Prefer to cc a suitable mailing list as well
   c) Ideally: find the original patch on the mailing list and do a
      reply-to-all to that, adding suitable additional cc's

*** Remember to use Documentation/SubmitChecklist when testing your code ***

The -mm tree is included into linux-next and is updated
there every 3-4 working days

------------------------------------------------------
From: "Srivatsa S. Bhat" <srivatsa.bhat@xxxxxxxxxxxxxxxxxx>
Subject: smp: print more useful debug info upon receiving IPI on an offline CPU

There is a longstanding problem related to CPU hotplug which causes IPIs
to be delivered to offline CPUs, and the smp-call-function IPI handler
code prints out a warning whenever this is detected.  Every once in a
while this (usually harmless) warning gets reported on LKML, but so far it
has not been completely fixed.  Usually the solution involves finding out
the IPI sender and fixing it by adding appropriate synchronization with
CPU hotplug.

However, while going through one such internal bug reports, I found that
there is a significant bug in the receiver side itself (more specifically,
in stop-machine) that can lead to this problem even when the sender code
is perfectly fine.  This patchset fixes that synchronization problem in
the CPU hotplug stop-machine code.

Patch 1 adds some additional debug code to the smp-call-function
framework, to help debug such issues easily.

Patch 2 modifies the stop-machine code to ensure that any IPIs that were
sent while the target CPU was online, would be noticed and handled by that
CPU without fail before it goes offline.  Thus, this avoids scenarios
where IPIs are received on offline CPUs (as long as the sender uses proper
hotplug synchronization).


In fact, I debugged the problem by using Patch 1, and found that the
payload of the IPI was always the block layer's trigger_softirq()
function.  But I was not able to find anything wrong with the block layer
code.  That's when I started looking at the stop-machine code and realized
that there is a race-window which makes the IPI _receiver_ the culprit,
not the sender.  Patch 2 fixes that race and hence this should put an end
to most of the hard-to-debug IPI-to-offline-CPU issues.




This patch (of 2):

Today the smp-call-function code just prints a warning if we get an IPI on
an offline CPU.  This info is sufficient to let us know that something
went wrong, but often it is very hard to debug exactly who sent the IPI
and why, from this info alone.

In most cases, we get the warning about the IPI to an offline CPU,
immediately after the CPU going offline comes out of the stop-machine
phase and reenables interrupts.  Since all online CPUs participate in
stop-machine, the information regarding the sender of the IPI is already
lost by the time we exit the stop-machine loop.  So even if we dump the
stack on each CPU at this point, we won't find anything useful since all
of them will show the stack-trace of the stopper thread.  So we need a
better way to figure out who sent the IPI and why.

To achieve this, when we detect an IPI targeted to an offline CPU, loop
through the call-single-data linked list and print out the payload (i.e.,
the name of the function which was supposed to be executed by the target
CPU).  This would give us an insight as to who might have sent the IPI and
help us debug this further.

Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@xxxxxxxxxxxxxxxxxx>
Cc: Peter Zijlstra <peterz@xxxxxxxxxxxxx>
Cc: Thomas Gleixner <tglx@xxxxxxxxxxxxx>
Cc: Ingo Molnar <mingo@xxxxxxxxxx>
Cc: Tejun Heo <tj@xxxxxxxxxx>
Cc: Rusty Russell <rusty@xxxxxxxxxxxxxxx>
Cc: Frederic Weisbecker <fweisbec@xxxxxxxxx>
Cc: Christoph Hellwig <hch@xxxxxxxxxxxxx>
Cc: Mel Gorman <mgorman@xxxxxxx>
Cc: Rik van Riel <riel@xxxxxxxxxx>
Cc: Borislav Petkov <bp@xxxxxxx>
Cc: Steven Rostedt <rostedt@xxxxxxxxxxx>
Cc: Mike Galbraith <mgalbraith@xxxxxxx>
Cc: Gautham R Shenoy <ego@xxxxxxxxxxxxxxxxxx>
Cc: "Paul E. McKenney" <paulmck@xxxxxxxxxxxxxxxxxx>
Cc: Oleg Nesterov <oleg@xxxxxxxxxx>
Cc: Rafael J. Wysocki <rjw@xxxxxxxxxxxxx>
Signed-off-by: Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx>
---

 kernel/smp.c |   15 ++++++++++++++-
 1 file changed, 14 insertions(+), 1 deletion(-)

diff -puN kernel/smp.c~smp-print-more-useful-debug-info-upon-receiving-ipi-on-an-offline-cpu kernel/smp.c
--- a/kernel/smp.c~smp-print-more-useful-debug-info-upon-receiving-ipi-on-an-offline-cpu
+++ a/kernel/smp.c
@@ -185,15 +185,28 @@ void generic_smp_call_function_single_in
 {
 	struct llist_node *entry;
 	struct call_single_data *csd, *csd_next;
+	int warn = 0;
 
 	/*
 	 * Shouldn't receive this interrupt on a cpu that is not yet online.
 	 */
-	WARN_ON_ONCE(!cpu_online(smp_processor_id()));
+	if (unlikely(!cpu_online(smp_processor_id()))) {
+		warn = 1;
+		WARN_ON_ONCE(1);
+	}
 
 	entry = llist_del_all(&__get_cpu_var(call_single_queue));
 	entry = llist_reverse_order(entry);
 
+	if (unlikely(warn)) {
+		/*
+		 * We don't have to use the _safe() variant here
+		 * because we are not invoking the IPI handlers yet.
+		 */
+		llist_for_each_entry(csd, entry, llist)
+			pr_warn("SMP IPI Payload: %pS \n", csd->func);
+	}
+
 	llist_for_each_entry_safe(csd, csd_next, entry, llist) {
 		csd->func(csd->info);
 		csd_unlock(csd);
_

Patches currently in -mm which might be from srivatsa.bhat@xxxxxxxxxxxxxxxxxx are

smp-print-more-useful-debug-info-upon-receiving-ipi-on-an-offline-cpu.patch
smp-print-more-useful-debug-info-upon-receiving-ipi-on-an-offline-cpu-fix.patch
cpu-hotplug-stop-machine-plug-race-window-that-leads-to-ipi-to-offline-cpu.patch

--
To unsubscribe from this list: send the line "unsubscribe mm-commits" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html




[Index of Archives]     [Kernel Newbies FAQ]     [Kernel Archive]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [Bugtraq]     [Photo]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]

  Powered by Linux