On Fri, 28 Jul 2017 08:44:11 +0100 Jonathan Cameron <Jonathan.Cameron@xxxxxxxxxx> wrote: > On Thu, 27 Jul 2017 09:52:45 -0700 > "Paul E. McKenney" <paulmck@xxxxxxxxxxxxxxxxxx> wrote: > > > On Thu, Jul 27, 2017 at 05:39:23PM +0100, Jonathan Cameron wrote: > > > On Thu, 27 Jul 2017 14:49:03 +0100 > > > Jonathan Cameron <Jonathan.Cameron@xxxxxxxxxx> wrote: > > > > > > > On Thu, 27 Jul 2017 05:49:13 -0700 > > > > "Paul E. McKenney" <paulmck@xxxxxxxxxxxxxxxxxx> wrote: > > > > > > > > > On Thu, Jul 27, 2017 at 02:34:00PM +1000, Nicholas Piggin wrote: > > > > > > On Wed, 26 Jul 2017 18:42:14 -0700 > > > > > > "Paul E. McKenney" <paulmck@xxxxxxxxxxxxxxxxxx> wrote: > > > > > > > > > > > > > On Wed, Jul 26, 2017 at 04:22:00PM -0700, David Miller wrote: > > > > > > > > > > > > > > Indeed, that really wouldn't explain how we end up with a RCU stall > > > > > > > > dump listing almost all of the cpus as having missed a grace period. > > > > > > > > > > > > > > I have seen stranger things, but admittedly not often. > > > > > > > > > > > > So the backtraces show the RCU gp thread in schedule_timeout. > > > > > > > > > > > > Are you sure that it's timeout has expired and it's not being scheduled, > > > > > > or could it be a bad (large) timeout (looks unlikely) or that it's being > > > > > > scheduled but not correctly noting gps on other CPUs? > > > > > > > > > > > > It's not in R state, so if it's not being scheduled at all, then it's > > > > > > because the timer has not fired: > > > > > > > > > > Good point, Nick! > > > > > > > > > > Jonathan, could you please reproduce collecting timer event tracing? > > > > I'm a little new to tracing (only started playing with it last week) > > > > so fingers crossed I've set it up right. No splats yet. Was getting > > > > splats on reading out the trace when running with the RCU stall timer > > > > set to 4 so have increased that back to the default and am rerunning. > > > > > > > > This may take a while. Correct me if I've gotten this wrong to save time > > > > > > > > echo "timer:*" > /sys/kernel/debug/tracing/set_event > > > > > > > > when it dumps, just send you the relevant part of what is in > > > > /sys/kernel/debug/tracing/trace? > > > > > > Interestingly the only thing that can make trip for me with tracing on > > > is peaking in the tracing buffers. Not sure this is a valid case or > > > not. > > > > > > Anyhow all timer activity seems to stop around the area of interest. > > > > > > Firstly sorry to those who got the rather silly length email a minute ago. It bounced on the list (fair enough - I was just being lazy on getting data past our firewalls). Ok. Some info. I disabled a few driver (usb and SAS) in the interest of having fewer timer events. Issue became much easier to trigger (on some runs before I could get tracing up and running) So logs are large enough that pastebin doesn't like them - please shout if another timer period is of interest. https://pastebin.com/iUZDfQGM for the timer trace. https://pastebin.com/3w1F7amH for dmesg. The relevant timeout on the RCU stall detector was 8 seconds. Event is detected around 835. It's a lot of logs, so I haven't identified a smoking gun yet but there may well be one in there. Jonathan -- To unsubscribe from this list: send the line "unsubscribe sparclinux" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html