----- Original Message ----- > > RE: Crash faults when determining panic task > > > It would interesting to find out what happened in the > > x86_process_elf_notes() function. Thanks for your help debugging this -- the dumpfile contains pretty much what I expected: (1) a single NT_PRSTATUS note (n_type 1, n_descsz 336) (2) followed by the VMCOREINFO note (n_type 0, n_descsz 1392), and (?) zero-filled dumpfile data (n_type 0, n_descsz 0) By comparison, if I add this debug printf() to x86_process_elf_notes() and run it against an 8-way compressed kdump: --- diskdump.c 30 Sep 2011 15:09:56 -0000 1.39 +++ diskdump.c 30 Sep 2011 18:58:28 -0000 @@ -243,7 +243,7 @@ for (tot = 0; tot < size_note; tot += len) { if (machine_type("X86_64")) { note64 = note_ptr + tot; - + fprintf(fp, "n_type: %d n_descsz: %d\n", note64->n_type, note64->n_descsz); if (note64->n_type == NT_PRSTATUS) { dd->nt_prstatus_percpu[num] = note64; num++; I see this: ... This program has absolutely no warranty. Enter "help warranty" for details. n_type: 1 n_descsz: 336 n_type: 1 n_descsz: 336 n_type: 1 n_descsz: 336 n_type: 1 n_descsz: 336 n_type: 1 n_descsz: 336 n_type: 1 n_descsz: 336 n_type: 1 n_descsz: 336 n_type: 1 n_descsz: 336 n_type: 0 n_descsz: 1373 GNU gdb (GDB) 7.0 Copyright (C) 2009 Free Software Foundation, Inc. ... So there's no extra zero-filled dumpfile location that gets checked, i.e., it cleanly works its way through the dumpfile's notes region. I don't know why that's not true with your dumpfile. And, as it turns out, the per-cpu readmem() complaint is perfectly legitimate -- I see the same thing on the compressed kdump example above. It's just that the loop has gone beyond the end of the per-cpu data -- in your case, it's trying to read non-existent per-cpu data for the non-existent cpu 16. So that's not a problem... I still don't understand why the dumpfile doesn't have the other 15 NT_PRSTATUS notes, but until that patch was added into crash-5.1.5, we never cared, and it would never have been noticed. When I accepted that patch, I was apprehensive that something like this might happen, which is why I insisted that they also add the "--no_elf_notes" option as a pre-emptive workaround: > https://www.redhat.com/archives/crash-utility/2011-April/msg00030.html > > Finally, in the interest of paranoia, give the user the capability > of *not* using this facility. In main.c, create a "--no_elf_notes" > option (similar to "--zero_excluded"), and have it set a NO_ELF_NOTES > bit in the globally-accessible "*diskdump_flags". So anyway, that all being the case, and with the two patches applied, we've pretty much solved your problem from the crash utility's perspective. Perhaps there's a kernel kdump or makedumpfile issue, but that's beyond the scope of this mailing list. Thanks again, Dave > > *** Breakpoints in x86_process_elf_notes()... > > (gdb) break diskdump.c:245 > Breakpoint 1 at 0x52379b: file diskdump.c, line 245. > (gdb) r > > Breakpoint 1, x86_process_elf_notes (note_ptr=0xd1e000, > size_note=1780) > at diskdump.c:245 > 245 note64 = note_ptr + tot; > (gdb) p *(Elf64_Nhdr *)(note_ptr + tot) > $1 = {n_namesz = 5, n_descsz = 336, n_type = 1} > (gdb) c > Continuing. > > Breakpoint 1, x86_process_elf_notes (note_ptr=0xd1e000, > size_note=1780) > at diskdump.c:245 > 245 note64 = note_ptr + tot; > (gdb) p *(Elf64_Nhdr *)(note_ptr + tot) > $2 = {n_namesz = 11, n_descsz = 1392, n_type = 0} > (gdb) c > Continuing. > > Breakpoint 1, x86_process_elf_notes (note_ptr=0xd1e000, > size_note=1780) > at diskdump.c:245 > 245 note64 = note_ptr + tot; > (gdb) p *(Elf64_Nhdr *)(note_ptr + tot) > $3 = {n_namesz = 0, n_descsz = 0, n_type = 0} > (gdb) c > Continuing. > > > >> crash: page excluded: kernel virtual address: ffffffff81bb3b00 > >> type: > "cpu number (per_cpu)" > >> crash: page excluded: kernel virtual address: ffffffff81bb3b00 > >> type: > "cpu number (per_cpu)" > > [snip] > > loop in both functions -- can you dump out which cpu's > > per-cpu data was inaccessible? > > (gdb) break memory.c:1976 > Breakpoint 1 at 0x4722ff: file memory.c, line 1976. > (gdb) set arg -d1 vmlinux vmcore > (gdb) r > Breakpoint 1, readmem (addr=18446744071591115520, memtype=1, > buffer=0x7fffffff5b5c, size=4, type=0x7c7744 "cpu number (per_cpu)", > error_handle=6) at memory.c:1976 > 1976 error(INFO, PAGE_EXCLUDED_ERRMSG, memtype_string(memtype, 0), > addr, type); > (gdb) up > #1 0x00000000004e5871 in x86_64_get_smp_cpus () at x86_64.c:4674 > 4674 if (!readmem(sp->value + kt->__per_cpu_offset[i], > (gdb) p cpunumber > $1 = 15 > (gdb) p cpus > $2 = 16 > (gdb) p i > $3 = 16 > (gdb) p/x kt->__per_cpu_offset[0]@17 > $4 = {0xffff880028200000, 0xffff880028240000, 0xffff880028280000, > 0xffff8800282c0000, 0xffff880287400000, 0xffff880287440000, > 0xffff880287480000, 0xffff8802874c0000, 0xffff880028300000, > 0xffff880028340000, 0xffff880028380000, 0xffff8800283c0000, > 0xffff880287500000, 0xffff880287540000, 0xffff880287580000, > 0xffff8802875c0000, 0xffffffff81ba6000} > > > > Joe, do you know if the non-crashing cpus were in some kind of > > bizarre state such that they would not respond to the shutdown NMI? > > I suppose in that case, there would be only the one NT_PRSTATUS > > note for the crashing cpu (plus the VMCOREINFO note). > > The other CPUs are almost all sitting idle, a few are running I/O. > > > In any case, so far I've got two patches queued to help address > > the two segmentation violations generated by a scenario such as > > this. > > Patches applied and verified no segmentation faults. > > I have uploaded this vmcore/vmlinux to our FTP site (details to come > in private mail). > > > Thanks, > > -- Joe Lawrence > -- > Crash-utility mailing list > Crash-utility@xxxxxxxxxx > https://www.redhat.com/mailman/listinfo/crash-utility > -- Crash-utility mailing list Crash-utility@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/crash-utility