Hello Oda-san, I have a xen-syms vmcore that finds a path that the hypervisor-related changes in lkcd_x86_trace.c cannot handle. When the back trace runs into the "process_softirqs" text return address reference from "xen/arch/x86/x86_32/entry.S", it cannot go any further. Therefore the backtrace fails, and in the recovery code it incorrectly searches for a (vmlinux) eframe: crash> bt -a PCPU: 0 VCPU: ffbc7080 bt: cannot resolve stack trace: #0 [ff1d3ebc] elf_core_save_regs at ff10a810 #1 [ff1d3ec4] common_interrupt at ff1222ed #2 [ff1d3ed0] do_nmi at ff1335bb #3 [ff1d3ef0] handle_nmi_mce at ff17442e #4 [ff1d3f24] csched_tick at ff110aa7 #5 [ff1d3f80] timer_softirq_action at ff1155d2 #6 [ff1d3fa0] do_softirq at ff1143fe #7 [ff1d3fb0] process_softirqs at ff173f61 bt: text symbols on stack: [ff1d3ebc] disable_local_APIC at ff11db75 [ff1d3ec0] crash_nmi_callback at ff13cc96 [ff1d3ec4] common_interrupt at ff1222f2 [ff1d3ed0] do_nmi at ff1335c1 [ff1d3ef0] handle_nmi_mce at ff174435 [ff1d3f18] csched_tick at ff110aa7 [ff1d3f80] timer_softirq_action at ff1155d4 [ff1d3fa0] do_softirq at ff114405 [ff1d3fb0] process_softirqs at ff173f66 bt: invalid structure size: task_struct FILE: x86.c LINE: 1576 FUNCTION: x86_eframe_search() [/usr/bin/crash] error trace: 816373b => 8164497 => 810c40c => 813ed94 813ed94: SIZE_verify+126 810c40c: x86_eframe_search+1075 8164497: handle_trace_error+692 816373b: lkcd_x86_back_trace+2370 bt: invalid structure size: task_struct FILE: x86.c LINE: 1576 FUNCTION: x86_eframe_search() crash> Now, the bogus vmlinux eframe search can be avoided by doing this in handle_trace_error(): --- lkcd_x86_trace.c.orig 2008-10-14 15:46:33.000000000 -0400 +++ lkcd_x86_trace.c 2008-10-14 16:09:26.000000000 -0400 @@ -2440,12 +2441,14 @@ handle_trace_error(struct bt_info *bt, i bt->flags |= BT_TEXT_SYMBOLS_PRINT|BT_ERROR_MASK; back_trace(bt); - bt->flags = BT_EFRAME_COUNT; - if ((cnt = machdep->eframe_search(bt))) { - error(INFO, "possible exception frame%s:\n", - cnt > 1 ? "s" : ""); - bt->flags &= ~(ulonglong)BT_EFRAME_COUNT; - machdep->eframe_search(bt); + if (!XEN_HYPER_MODE()) { + bt->flags = BT_EFRAME_COUNT; + if ((cnt = machdep->eframe_search(bt))) { + error(INFO, "possible exception frame%s:\n", + cnt > 1 ? "s" : ""); + bt->flags &= ~(ulonglong)BT_EFRAME_COUNT; + machdep->eframe_search(bt); + } } } After doing the above, the bt -a shows this, and therefore does not fail prematurely: crash> bt -a PCPU: 0 VCPU: ffbc7080 bt: cannot resolve stack trace: #0 [ff1d3ebc] elf_core_save_regs at ff10a810 #1 [ff1d3ec4] common_interrupt at ff1222ed #2 [ff1d3ed0] do_nmi at ff1335bb #3 [ff1d3ef0] handle_nmi_mce at ff17442e #4 [ff1d3f24] csched_tick at ff110aa7 #5 [ff1d3f80] timer_softirq_action at ff1155d2 #6 [ff1d3fa0] do_softirq at ff1143fe #7 [ff1d3fb0] process_softirqs at ff173f61 bt: text symbols on stack: [ff1d3ebc] disable_local_APIC at ff11db75 [ff1d3ec0] crash_nmi_callback at ff13cc96 [ff1d3ec4] common_interrupt at ff1222f2 [ff1d3ed0] do_nmi at ff1335c1 [ff1d3ef0] handle_nmi_mce at ff174435 [ff1d3f18] csched_tick at ff110aa7 [ff1d3f80] timer_softirq_action at ff1155d4 [ff1d3fa0] do_softirq at ff114405 [ff1d3fb0] process_softirqs at ff173f66 PCPU: 1 VCPU: ff1b6080 ... Carrying it one step further, and given that the relevant part of the stack from above looks like this: crash> rd -s ff1d3ebc 84 ff1d3ebc: disable_local_APIC+5 crash_nmi_callback+38 common_interrupt+82 cpu0_stack+16076 ff1d3ecc: 0003d027 do_nmi+49 cpu0_stack+16120 00000000 ff1d3edc: ffbca000 ffbcbeb0 00000030 cpu0_stack+16308 ff1d3eec: 0000e010 handle_nmi_mce+91 cpu0_stack+16120 00000100 ff1d3efc: 00000005 000000ff 000005dc ffbdee88 ff1d3f0c: 00000000 00000960 00020000 csched_tick+1239 ff1d3f1c: 0000e008 00000083 ffbc7080 00000030 ff1d3f2c: 0003d027 80000003 000583a8 per_cpu__schedule_data ff1d3f3c: c840ceb2 00000000 ffbfda80 00000000 ff1d3f4c: 00000000 00000000 00000100 00000960 ff1d3f5c: ffbdee80 00000246 000000ff csched_priv+4 ff1d3f6c: 00000000 ffbfda8c __per_cpu_data_end+54972 e4c5d8d9 ff1d3f7c: 0000008b timer_softirq_action+132 00000000 ffbc7080 ff1d3f8c: per_cpu__timers 00000000 cpu0_stack+16308 0000007b ff1d3f9c: eaed7700 do_softirq+53 00000000 ffbc7080 ff1d3fac: 0000007b process_softirqs+6 eb396d84 00000002 ff1d3fbc: c0678470 c0678470 00000002 eaed7700 ff1d3fcc: 00000000 000d0000 c04011a7 00000061 ff1d3fdc: 00000202 eb396d48 00000069 0000007b ff1d3fec: 0000007b 00000000 00000000 00000000 ff1d3ffc: ffbc7080 ffffffff ffffffff ffffffff crash> Clearly "process_softirqs" is the last text return address reference that the backtrace code can work with. So to try to clean up the backtrace, I added this: --- lkcd_x86_trace.c.orig 2008-10-14 15:46:33.000000000 -0400 +++ lkcd_x86_trace.c 2008-10-14 16:09:26.000000000 -0400 @@ -1423,6 +1423,7 @@ find_trace( if (XEN_HYPER_MODE()) { func_name = kl_funcname(pc); if (STREQ(func_name, "idle_loop") || STREQ(func_name, "hypercall") + || STREQ(func_name, "process_softirqs") || STREQ(func_name, "tracing_off") || STREQ(func_name, "handle_exception")) { UPDATE_FRAME(func_name, pc, 0, sp, bp, asp, 0, 0, bp - sp, 0); which shows: crash> bt -a PCPU: 0 VCPU: ffbc7080 #0 [ff1d3ebc] elf_core_save_regs at ff10a810 #1 [ff1d3ec4] common_interrupt at ff1222ed #2 [ff1d3ed0] do_nmi at ff1335bb #3 [ff1d3ef0] handle_nmi_mce at ff17442e #4 [ff1d3f24] csched_tick at ff110aa7 #5 [ff1d3f80] timer_softirq_action at ff1155d2 #6 [ff1d3fa0] do_softirq at ff1143fe #7 [ff1d3fb0] process_softirqs at ff173f61 PCPU: 1 VCPU: ff1b6080 ... The patch to avoid eframe search can be avoided entirely by applying the second patch, but it seems that it should be left in place for other unforeseen possibilities in the future. Do you agree with these changes? Thanks, Dave -- Crash-utility mailing list Crash-utility@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/crash-utility