Hi Dave, > Do you agree with these changes? Yes. Thank you. Itsuro Oda On Tue, 14 Oct 2008 16:30:18 -0400 (EDT) Dave Anderson <anderson@xxxxxxxxxx> wrote: > > Hello Oda-san, > > I have a xen-syms vmcore that finds a path that the hypervisor-related > changes in lkcd_x86_trace.c cannot handle. When the back trace runs > into the "process_softirqs" text return address reference from > "xen/arch/x86/x86_32/entry.S", it cannot go any further. Therefore > the backtrace fails, and in the recovery code it incorrectly searches > for a (vmlinux) eframe: > > crash> bt -a > PCPU: 0 VCPU: ffbc7080 > bt: cannot resolve stack trace: > #0 [ff1d3ebc] elf_core_save_regs at ff10a810 > #1 [ff1d3ec4] common_interrupt at ff1222ed > #2 [ff1d3ed0] do_nmi at ff1335bb > #3 [ff1d3ef0] handle_nmi_mce at ff17442e > #4 [ff1d3f24] csched_tick at ff110aa7 > #5 [ff1d3f80] timer_softirq_action at ff1155d2 > #6 [ff1d3fa0] do_softirq at ff1143fe > #7 [ff1d3fb0] process_softirqs at ff173f61 > bt: text symbols on stack: > [ff1d3ebc] disable_local_APIC at ff11db75 > [ff1d3ec0] crash_nmi_callback at ff13cc96 > [ff1d3ec4] common_interrupt at ff1222f2 > [ff1d3ed0] do_nmi at ff1335c1 > [ff1d3ef0] handle_nmi_mce at ff174435 > [ff1d3f18] csched_tick at ff110aa7 > [ff1d3f80] timer_softirq_action at ff1155d4 > [ff1d3fa0] do_softirq at ff114405 > [ff1d3fb0] process_softirqs at ff173f66 > > bt: invalid structure size: task_struct > FILE: x86.c LINE: 1576 FUNCTION: x86_eframe_search() > > [/usr/bin/crash] error trace: 816373b => 8164497 => 810c40c => 813ed94 > > 813ed94: SIZE_verify+126 > 810c40c: x86_eframe_search+1075 > 8164497: handle_trace_error+692 > 816373b: lkcd_x86_back_trace+2370 > > bt: invalid structure size: task_struct > FILE: x86.c LINE: 1576 FUNCTION: x86_eframe_search() > > crash> > > Now, the bogus vmlinux eframe search can be avoided by doing this in > handle_trace_error(): > > --- lkcd_x86_trace.c.orig 2008-10-14 15:46:33.000000000 -0400 > +++ lkcd_x86_trace.c 2008-10-14 16:09:26.000000000 -0400 > @@ -2440,12 +2441,14 @@ handle_trace_error(struct bt_info *bt, i > bt->flags |= BT_TEXT_SYMBOLS_PRINT|BT_ERROR_MASK; > back_trace(bt); > > - bt->flags = BT_EFRAME_COUNT; > - if ((cnt = machdep->eframe_search(bt))) { > - error(INFO, "possible exception frame%s:\n", > - cnt > 1 ? "s" : ""); > - bt->flags &= ~(ulonglong)BT_EFRAME_COUNT; > - machdep->eframe_search(bt); > + if (!XEN_HYPER_MODE()) { > + bt->flags = BT_EFRAME_COUNT; > + if ((cnt = machdep->eframe_search(bt))) { > + error(INFO, "possible exception frame%s:\n", > + cnt > 1 ? "s" : ""); > + bt->flags &= ~(ulonglong)BT_EFRAME_COUNT; > + machdep->eframe_search(bt); > + } > } > } > > After doing the above, the bt -a shows this, and therefore does > not fail prematurely: > > crash> bt -a > PCPU: 0 VCPU: ffbc7080 > bt: cannot resolve stack trace: > #0 [ff1d3ebc] elf_core_save_regs at ff10a810 > #1 [ff1d3ec4] common_interrupt at ff1222ed > #2 [ff1d3ed0] do_nmi at ff1335bb > #3 [ff1d3ef0] handle_nmi_mce at ff17442e > #4 [ff1d3f24] csched_tick at ff110aa7 > #5 [ff1d3f80] timer_softirq_action at ff1155d2 > #6 [ff1d3fa0] do_softirq at ff1143fe > #7 [ff1d3fb0] process_softirqs at ff173f61 > bt: text symbols on stack: > [ff1d3ebc] disable_local_APIC at ff11db75 > [ff1d3ec0] crash_nmi_callback at ff13cc96 > [ff1d3ec4] common_interrupt at ff1222f2 > [ff1d3ed0] do_nmi at ff1335c1 > [ff1d3ef0] handle_nmi_mce at ff174435 > [ff1d3f18] csched_tick at ff110aa7 > [ff1d3f80] timer_softirq_action at ff1155d4 > [ff1d3fa0] do_softirq at ff114405 > [ff1d3fb0] process_softirqs at ff173f66 > > PCPU: 1 VCPU: ff1b6080 > ... > > Carrying it one step further, and given that the relevant part > of the stack from above looks like this: > > crash> rd -s ff1d3ebc 84 > ff1d3ebc: disable_local_APIC+5 crash_nmi_callback+38 common_interrupt+82 cpu0_stack+16076 > ff1d3ecc: 0003d027 do_nmi+49 cpu0_stack+16120 00000000 > ff1d3edc: ffbca000 ffbcbeb0 00000030 cpu0_stack+16308 > ff1d3eec: 0000e010 handle_nmi_mce+91 cpu0_stack+16120 00000100 > ff1d3efc: 00000005 000000ff 000005dc ffbdee88 > ff1d3f0c: 00000000 00000960 00020000 csched_tick+1239 > ff1d3f1c: 0000e008 00000083 ffbc7080 00000030 > ff1d3f2c: 0003d027 80000003 000583a8 per_cpu__schedule_data > ff1d3f3c: c840ceb2 00000000 ffbfda80 00000000 > ff1d3f4c: 00000000 00000000 00000100 00000960 > ff1d3f5c: ffbdee80 00000246 000000ff csched_priv+4 > ff1d3f6c: 00000000 ffbfda8c __per_cpu_data_end+54972 e4c5d8d9 > ff1d3f7c: 0000008b timer_softirq_action+132 00000000 ffbc7080 > ff1d3f8c: per_cpu__timers 00000000 cpu0_stack+16308 0000007b > ff1d3f9c: eaed7700 do_softirq+53 00000000 ffbc7080 > ff1d3fac: 0000007b process_softirqs+6 eb396d84 00000002 > ff1d3fbc: c0678470 c0678470 00000002 eaed7700 > ff1d3fcc: 00000000 000d0000 c04011a7 00000061 > ff1d3fdc: 00000202 eb396d48 00000069 0000007b > ff1d3fec: 0000007b 00000000 00000000 00000000 > ff1d3ffc: ffbc7080 ffffffff ffffffff ffffffff > crash> > > Clearly "process_softirqs" is the last text return address > reference that the backtrace code can work with. So to try > to clean up the backtrace, I added this: > > --- lkcd_x86_trace.c.orig 2008-10-14 15:46:33.000000000 -0400 > +++ lkcd_x86_trace.c 2008-10-14 16:09:26.000000000 -0400 > @@ -1423,6 +1423,7 @@ find_trace( > if (XEN_HYPER_MODE()) { > func_name = kl_funcname(pc); > if (STREQ(func_name, "idle_loop") || STREQ(func_name, "hypercall") > + || STREQ(func_name, "process_softirqs") > || STREQ(func_name, "tracing_off") > || STREQ(func_name, "handle_exception")) { > UPDATE_FRAME(func_name, pc, 0, sp, bp, asp, 0, 0, bp - sp, 0); > > which shows: > > crash> bt -a > PCPU: 0 VCPU: ffbc7080 > #0 [ff1d3ebc] elf_core_save_regs at ff10a810 > #1 [ff1d3ec4] common_interrupt at ff1222ed > #2 [ff1d3ed0] do_nmi at ff1335bb > #3 [ff1d3ef0] handle_nmi_mce at ff17442e > #4 [ff1d3f24] csched_tick at ff110aa7 > #5 [ff1d3f80] timer_softirq_action at ff1155d2 > #6 [ff1d3fa0] do_softirq at ff1143fe > #7 [ff1d3fb0] process_softirqs at ff173f61 > > PCPU: 1 VCPU: ff1b6080 > ... > > The patch to avoid eframe search can be avoided entirely by applying > the second patch, but it seems that it should be left in place for > other unforeseen possibilities in the future. > > Do you agree with these changes? > > Thanks, > Dave > -- Itsuro ODA <oda@xxxxxxxxxxxxx> -- Crash-utility mailing list Crash-utility@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/crash-utility