On Tue, Nov 01, 2016 at 10:47:37PM +0200, Ville Syrjälä wrote: > On Fri, Oct 28, 2016 at 08:58:41PM +0200, Thomas Gleixner wrote: > > On Fri, 28 Oct 2016, Ville Syrjälä wrote: > > > On Thu, Oct 27, 2016 at 10:41:18PM +0200, Thomas Gleixner wrote: > > > > On Thu, 27 Oct 2016, Ville Syrjälä wrote: > > > > > On Thu, Oct 27, 2016 at 09:25:05PM +0200, Thomas Gleixner wrote: > > > > > > So it would be interesting whether that hunk in resume_broadcast() is > > > > > > sufficient. > > > > > > > > > > So far it looks like the answer is yes. > > > > > > > > > > Looks to be about 5 seconds slower than acpi-idle in resuming, but > > > > > I suppose that's not all that surprising ;) > > > > > > > > Well, set it to 1msec then. If that works reliably then we really can do > > > > that unconditionally. There is no harm in firing a useless timer during > > > > resume once. > > > > > > I narrowed down the required timeout, and looks like 25ms is the > > > minimum that works. With 24ms I already started to have failures. So > > > maybe just bump it up by an order of magnitude to 250ms for some > > > safety margin? > > I left the thing running for the weekend and it failed 26 out of 16057 > times with the 25ms timeout. Looks like it takes ~5 minutes to resume > when it fails, but eventually it does come back. > > > > > Sure, but what puzzles me is that we need a timeout that big. What happens > > between broadcast_resume() and broadcast_resume() + 25ms? > > > > IOW, what is the event/resume function which we need to bridge. We should > > really try to track than down. > > My hunch would be that SMM trap in the DSDT/SSDT since that's where > things ended up last time I was tracing these resume problems. Though I > can't recall if that was just with acpi-idle or if intel_idle landed in > the same spot as well. > > I guess I can try to repeat that test tomorrow, or I'll try your function > tracer method if the other thing fails. I didn't manage to find a lot of time to play around with this, but it definitely looks like the SMM trap is the problem here. I repeated my pm_trace experiemnts and when it gets stuck it is trying to execute the _WAK ACPI method which is where the SMM trap happens. Maybe the SMM code was written with the expectation of a periodic tick or something like that? > > > > > You might try to enable function tracing and do a tracing_off() when that > > 25ms timeout fires. > > > > Something like > > > > stop_trace = true; > > > > in broadcast_resume() and then in the broadcast timer function: > > > > if (stop_trace) { > > stop_trace = false; > > tracing_off(); > > } > > > > Then when the machine is up read the trace, compress and upload it > > somewhere or send it in private mail if it's not that big. > > > > Thanks, > > > > tglx > > > -- > Ville Syrjälä > Intel OTC -- Ville Syrjälä Intel OTC -- To unsubscribe from this list: send the line "unsubscribe linux-arch" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html