On Mon, Mar 01, 2021 at 11:13:36AM +0100, Juergen Gross wrote: > In order to help identifying problems with IPI handling and remote > function execution add some more data to IPI debugging code. > > There have been multiple reports of cpus looping long times (many > seconds) in smp_call_function_many() waiting for another cpu executing > a function like tlb flushing. Most of these reports have been for > cases where the kernel was running as a guest on top of KVM or Xen > (there are rumours of that happening under VMWare, too, and even on > bare metal). > > Finding the root cause hasn't been successful yet, even after more than > 2 years of chasing this bug by different developers. > > Commit 35feb60474bf4f7 ("kernel/smp: Provide CSD lock timeout > diagnostics") tried to address this by adding some debug code and by > issuing another IPI when a hang was detected. This helped mitigating > the problem (the repeated IPI unlocks the hang), but the root cause is > still unknown. > > Current available data suggests that either an IPI wasn't sent when it > should have been, or that the IPI didn't result in the target cpu > executing the queued function (due to the IPI not reaching the cpu, > the IPI handler not being called, or the handler not seeing the queued > request). > > Try to add more diagnostic data by introducing a global atomic counter > which is being incremented when doing critical operations (before and > after queueing a new request, when sending an IPI, and when dequeueing > a request). The counter value is stored in percpu variables which can > be printed out when a hang is detected. > > The data of the last event (consisting of sequence counter, source > cpu, target cpu, and event type) is stored in a global variable. When > a new event is to be traced, the data of the last event is stored in > the event related percpu location and the global data is updated with > the new event's data. This allows to track two events in one data > location: one by the value of the event data (the event before the > current one), and one by the location itself (the current event). > > A typical printout with a detected hang will look like this: > > csd: Detected non-responsive CSD lock (#1) on CPU#1, waiting 5000000003 ns for CPU#06 scf_handler_1+0x0/0x50(0xffffa2a881bb1410). > csd: CSD lock (#1) handling prior scf_handler_1+0x0/0x50(0xffffa2a8813823c0) request. > csd: cnt(00008cc): ffff->0000 dequeue (src cpu 0 == empty) > csd: cnt(00008cd): ffff->0006 idle > csd: cnt(0003668): 0001->0006 queue > csd: cnt(0003669): 0001->0006 ipi > csd: cnt(0003e0f): 0007->000a queue > csd: cnt(0003e10): 0001->ffff ping > csd: cnt(0003e71): 0003->0000 ping > csd: cnt(0003e72): ffff->0006 gotipi > csd: cnt(0003e73): ffff->0006 handle > csd: cnt(0003e74): ffff->0006 dequeue (src cpu 0 == empty) > csd: cnt(0003e7f): 0004->0006 ping > csd: cnt(0003e80): 0001->ffff pinged > csd: cnt(0003eb2): 0005->0001 noipi > csd: cnt(0003eb3): 0001->0006 queue > csd: cnt(0003eb4): 0001->0006 noipi > csd: cnt now: 0003f00 > > The idea is to print only relevant entries. Those are all events which > are associated with the hang (so sender side events for the source cpu > of the hanging request, and receiver side events for the target cpu), > and the related events just before those (for adding data needed to > identify a possible race). Printing all available data would be > possible, but this would add large amounts of data printed on larger > configurations. > > Signed-off-by: Juergen Gross <jgross@xxxxxxxx> > Tested-by: Paul E. McKenney <paulmck@xxxxxxxxxx> > --- > V2: > - add automatic data deciphering and sorting of entries > - add new trace point for leaving flush_smp_call_function_queue() > - add information when finding an empty call_single_queue They do not apply on top of these: https://lkml.kernel.org/r/20210220231712.2475218-2-namit@xxxxxxxxxx :-/ > @@ -290,6 +476,19 @@ static DEFINE_PER_CPU_SHARED_ALIGNED(call_single_data_t, csd_data); > > void __smp_call_single_queue(int cpu, struct llist_node *node) > { > +#ifdef CONFIG_CSD_LOCK_WAIT_DEBUG > + if (static_branch_unlikely(&csdlock_debug_extended)) { > + unsigned int type; > + > + type = CSD_TYPE(container_of(node, call_single_data_t, > + node.llist)); > + if (type == CSD_TYPE_SYNC || type == CSD_TYPE_ASYNC) { > + __smp_call_single_queue_debug(cpu, node); > + return; > + } > + } > +#endif This really ought to be in generic_exec_single(), because there we know the type matches.