On Thu, 2017-03-30 at 20:50 +0200, Takashi Iwai wrote: <snip> > > Sure, if we get a proper stack dump, we can analyze it somehow. You > can use addr2line, or even check objdump output manually. > But in this case, as already mentioned, it was impossible to get any > sensible stack trace on my machine with 4.11-rc, so far, > unfortunately. So no material to read. huh? I thought that was what the file called "screenshot showing kernel panic trace" on the bugzilla was (although that backtrace definitely didn't look too relevant)... anyway if you are having trouble getting just a stack trace though, one of my coworkers here has taught me a trick called divide and conquer. The idea is pretty simple. Let's say we have a block of code like this in the kernel void some_resume_func() { cool_function_call(); this_is_neat_too(); foo(); bar(); death(); baz(); zab(); } And you know it's crashing inside this function on resume (e.g. it could be in foo(), bar(), or that suspicious death() function) but you have no way of getting a back trace. This is where the trick comes in: while you might not be able to get a stack trace, you can probably at least tell the difference between when the machine reboots immediately as a result of calling emergency_restart(), and whether it's just hanging due to the bug. So what you do is kind of like bisecting, except instead of testing different commits you see what happens when you insert a call to emergency_restart() and move it around: - Try #1: void some_resume_func() { cool_function_call(); this_is_neat_too(); foo(); emergency_restart(); bar(); death(); baz(); zab(); } The machine immediately reboots, so the problem is below where we inserted the emergency_reboot() call - Try #2: void some_resume_func() { cool_function_call(); this_is_neat_too(); foo(); bar(); death(); emergency_restart(); baz(); zab(); } The machine hangs, so we know the problem's either in the call to bar() or death(). - Try #3: void some_resume_func() { cool_function_call(); this_is_neat_too(); foo(); bar(); emergency_restart(); death(); baz(); zab(); } The machine reboots immediately this time, which means that the problem has to be occurring inside the suspicious death() function. Of course, if we want to keep debugging further we can go into the death() function itself and try the same thing to figure out which line inside it is causing the issue. So if you do this except around wherever it looks like this crash might be happening. From: https://bugzilla.suse.com/show_bug.cgi?id=1029634#c5 It sounds like this happens on hotplugging, so the place to start this would probably be i915_hotplug_work_func(). Keep going down the call stack there and you should eventually find the culprit. The only complication I foresee here is that you'll have to write a little bit of additional debugging code so that i915_hotplug_work_func() doesn't actually call emergency_restart() until right before the moment where the crash happens. This shouldn't be too difficult, you could do something like add a module parameter to i915 that you change right before the final step of reproducing the bug that enables the calls to emergency_restart(). If you have any trouble with this part, feel free to let me know and I'll hack together a quick patch you can use. Lemme know if this helps at all :). > > That is, the problem isn't how to translate it, but how to get it. > Normal ways didn't work. Maybe I can try AMT, but I doubt that it'll > give any output since kdump already failed... > > > thanks, > > Takashi _______________________________________________ Intel-gfx mailing list Intel-gfx@xxxxxxxxxxxxxxxxxxxxx https://lists.freedesktop.org/mailman/listinfo/intel-gfx