On Fri, Jul 28, 2017 at 02:24:03PM +0100, Jonathan Cameron wrote: > On Fri, 28 Jul 2017 08:44:11 +0100 > Jonathan Cameron <Jonathan.Cameron@xxxxxxxxxx> wrote: [ . . . ] > Ok. Some info. I disabled a few driver (usb and SAS) in the interest of having > fewer timer events. Issue became much easier to trigger (on some runs before > I could get tracing up and running) >e > So logs are large enough that pastebin doesn't like them - please shoet if >>e another timer period is of interest. > > https://pastebin.com/iUZDfQGM for the timer trace. > https://pastebin.com/3w1F7amH for dmesg. > > The relevant timeout on the RCU stall detector was 8 seconds. Event is > detected around 835. > > It's a lot of logs, so I haven't identified a smoking gun yet but there > may well be one in there. The dmesg says: rcu_preempt kthread starved for 2508 jiffies! g112 c111 f0x0 RCU_GP_WAIT_FQS(3) ->state=0x1 So I look for "rcu_preempt" timer events and find these: rcu_preempt-9 [019] .... 827.579114: timer_init: timer=ffff8017d5fc7da0 rcu_preempt-9 [019] d..1 827.579115: timer_start: timer=ffff8017d5fc7da0 function=process_timeout Next look for "ffff8017d5fc7da0" and I don't find anything else. The timeout was one jiffy, and more than a second later, no expiration. Is it possible that this event was lost? I am not seeing any sign of this is the trace. I don't see any sign of CPU hotplug (and I test with lots of that in any case). The last time we saw something like this it was a timer HW/driver problem, but it is a bit hard to imagine such a problem affecting both ARM64 and SPARC. ;-) Thomas, any debugging suggestions? Thanx, Paul -- To unsubscribe from this list: send the line "unsubscribe sparclinux" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html