On Thu, 30 Mar 2017 22:01:31 +0200, Lyude Paul wrote: > > On Thu, 2017-03-30 at 20:50 +0200, Takashi Iwai wrote: > <snip> > > > > Sure, if we get a proper stack dump, we can analyze it somehow. You > > can use addr2line, or even check objdump output manually. > > But in this case, as already mentioned, it was impossible to get any > > sensible stack trace on my machine with 4.11-rc, so far, > > unfortunately. So no material to read. > > huh? I thought that was what the file called "screenshot showing kernel > panic trace" on the bugzilla was (although that backtrace definitely > didn't look too relevant)... It's not from my machine, and it's not from 4.11-rc. It's a screenshot taken on 4.4.x openSUSE kernel with the backport of your fix. So it might be some help, but the stack trace there is merely a red herring. The reason I couldn't get such a screenshot is that VT switching is broken on 4.11 in multiple ways. One VT bug got fixed in 4.11-rc4, but another still remains.... > anyway if you are having trouble getting > just a stack trace though, one of my coworkers here has taught me a > trick called divide and conquer. > > The idea is pretty simple. Let's say we have a block of code like this > in the kernel > > void some_resume_func() { > cool_function_call(); > this_is_neat_too(); > > foo(); > bar(); > death(); > baz(); > zab(); > } > > And you know it's crashing inside this function on resume (e.g. it > could be in foo(), bar(), or that suspicious death() function) but you > have no way of getting a back trace. > > This is where the trick comes in: while you might not be able to get a > stack trace, you can probably at least tell the difference between when > the machine reboots immediately as a result of calling > emergency_restart(), and whether it's just hanging due to the bug. > > So what you do is kind of like bisecting, except instead of testing > different commits you see what happens when you insert a call to > emergency_restart() and move it around: > > - Try #1: > > void some_resume_func() { > cool_function_call(); > this_is_neat_too(); > > foo(); > emergency_restart(); > bar(); > death(); > baz(); > zab(); > } > > The machine immediately reboots, so the problem is below where we > inserted the emergency_reboot() call > > - Try #2: > > void some_resume_func() { > cool_function_call(); > this_is_neat_too(); > > foo(); > bar(); > death(); > emergency_restart(); > baz(); > zab(); > } > > The machine hangs, so we know the problem's either in the call to bar() > or death(). > > - Try #3: > > void some_resume_func() { > cool_function_call(); > this_is_neat_too(); > > foo(); > bar(); > emergency_restart(); > death(); > baz(); > zab(); > } > > The machine reboots immediately this time, which means that the problem > has to be occurring inside the suspicious death() function. Of course, > if we want to keep debugging further we can go into the death() > function itself and try the same thing to figure out which line inside > it is causing the issue. Heh, the divide-and-conquer is also the strategy how I reached to my patch :) I divided the possible cause (the call of intel_dp_mst_resume()), split them, and luckily it worked by the first shot. > So if you do this except around wherever it looks like this crash might > be happening. From: > > https://bugzilla.suse.com/show_bug.cgi?id=1029634#c5 > > It sounds like this happens on hotplugging, so the place to start this > would probably be i915_hotplug_work_func(). Keep going down the call > stack there and you should eventually find the culprit. > > The only complication I foresee here is that you'll have to write a > little bit of additional debugging code so that > i915_hotplug_work_func() doesn't actually call emergency_restart() > until right before the moment where the crash happens. This shouldn't > be too difficult, you could do something like add a module parameter to > i915 that you change right before the final step of reproducing the bug > that enables the calls to emergency_restart(). If you have any trouble > with this part, feel free to let me know and I'll hack together a quick > patch you can use. Right, that's the most difficult part; for reproducing the crash, we need multiple suspend/resume and dock/undock, so the code path may be executed multiple times. And tracking i915_hotplug_work_func() in the way you suggested isn't so trivial, as it's with full of indirect calls... A trick I often used instead is to put additional delays (very long ones) between the suspected code lines with marking via trace_ or normal printk, and track at which point we could reach. Then you don't need a frequent reboot but just a few long runs. Of course, it can't be used for irq context, but for the work, it's OK. Maybe I'll give it a try, but likely later in the next week; I'll be very busy for other tasks in tomorrow, sorry. thanks, Takashi > > Lemme know if this helps at all :). > > > > > That is, the problem isn't how to translate it, but how to get it. > > Normal ways didn't work. Maybe I can try AMT, but I doubt that it'll > > give any output since kdump already failed... > > > > > > thanks, > > > > Takashi > _______________________________________________ Intel-gfx mailing list Intel-gfx@xxxxxxxxxxxxxxxxxxxxx https://lists.freedesktop.org/mailman/listinfo/intel-gfx