Hi Dave, Before the patch mentioned below, makedumpfile hardcoded nr_cpus to 1 in the header. .......................... author Ken'ichi Ohmichi <oomichi@xxxxxxxxxxxxxxxxx> Thu, 11 Nov 2010 03:53:16 +0000 (12:53 +0900) Before applying this patch, makedumpfile sets "1" to nr_cpus in kdump main header always even if a machine has multiple cpus. As the result, the subcommand "help -n" of the crash utility prints an invalid value "1" as nr_cpus. ........................... Could this be part of Joe's issue? Thanks, Jeff Hagen -----Original Message----- From: crash-utility-bounces@xxxxxxxxxx [mailto:crash-utility-bounces@xxxxxxxxxx] On Behalf Of Dave Anderson Sent: Thursday, September 29, 2011 2:04 PM To: Discussion list for crash utility usage,maintenance and development Subject: Re: Crash faults when determining panic task ----- Original Message ----- > > Hi Dave, > > I hope I have captured everything you asked for here, if remote > debugging over e-mail is too tedious, I can arrange to post a > vmlinux/vmcore on our FTP site (roughly 600MB together). Sure, you can do that if you'd like. But anyway, the crash -d1 output is illuminating. You've got a 16-cpu system, with all cpus online. But the compressed kdump header only saw 1 cpu when it was created: > header: 2cc1fe0 > signature: "KDUMP " > header_version: 4 > utsname: > sysname: Linux > nodename: bahamut.mno.stratus.com > release: 2.6.32-131.0.15.el6.exp10.bz16586.x86_64 > version: #1 SMP Thu Jun 16 13:13:45 EDT 2011 > machine: x86_64 > domainname: sraeng > timestamp: > tv_sec: 4e4fe6e3 > tv_usec: 0 > status: 0 () > block_size: 4096 > sub_hdr_size: 1 > bitmap_blocks: 288 > max_mapnr: 4718592 > total_ram_blocks: 0 > device_blocks: 0 > written_blocks: 0 > current_cpu: 0 > nr_cpus: 1 <== should be 16 > tasks[nr_cpus]: 0 and farther on, here is the array of note pointers that I was asking about: > num_prstatus_notes: 1 > notes_buf: 2cc4000 > notes[0]: 2cc4000 Since dumpfile header's nr_cpus was 1, the array has only one entry. I cannot explain that. But the crash utility can only deal with what it finds in the dumpfile. Furthermore, these two error messages indicate that memory containing per-cpu data was "excluded" are of prime importance here: > crash: page excluded: kernel virtual address: ffffffff81bb3b00 type: "cpu number (per_cpu)" > crash: page excluded: kernel virtual address: ffffffff81bb3b00 type: "cpu number (per_cpu)" The fact that the page was specifically "page excluded" is troubling, because it should *never* have been filtered by makedumpfile -d<level>. But, since the dumpfile indicates that the crucial per-cpu page was filtered, so there's nothing that the crash utility can do about it. I'm guessing that, even though you are able to get to a prompt with --no_elf_notes, any command that depends upon per-cpu data would fail. Although, it might be interesting to know *which* cpu was in play when those two error messages were displayed in x86_64_per_cpu_init() and x86_64_get_smp_cpus(). There is a loop in both functions -- can you dump out which cpu's per-cpu data was inaccessible? Thanks, Dave > > > *** Setup some breakpoints to watch bt->machdep: > > get_netdump_regs_x86_64(struct bt_info *bt, ulong *ripp, ulong *rspp) > { > ... > > if (((NETDUMP_DUMPFILE() || KDUMP_DUMPFILE()) && > VALID_STRUCT(user_regs_struct) && (bt->task == tt->panic_task)) || > (KDUMP_DUMPFILE() && (kt->flags & DWARF_UNWIND) && > (bt->flags & BT_DUMPFILE_SEARCH))) { > ... > 2287 bt->machdep = (void *)user_regs; > ... > > if (ELF_NOTES_VALID() && > (bt->flags & BT_DUMPFILE_SEARCH) && DISKDUMP_DUMPFILE() && > (note = (Elf64_Nhdr *) > diskdump_get_prstatus_percpu(bt->tc->processor))) { > ... > 2306 bt->machdep = (void *)user_regs; > ... > > (gdb) break get_netdump_regs_x86_64 > Breakpoint 1 at 0x519740: file netdump.c, line 2238. > (gdb) break netdump.c:2287 > Breakpoint 2 at 0x519970: file netdump.c, line 2287. > (gdb) break netdump.c:2306 > Breakpoint 3 at 0x5199e7: file netdump.c, line 2306. > (gdb) r > > please wait... (determining panic task) > Breakpoint 1, get_netdump_regs_x86_64 (bt=0x7fffffffcd70, > ripp=0x7fffffffcce0, > rspp=0x7fffffffcce8) at netdump.c:2238 > 2238 { > (gdb) c > Continuing. > > Breakpoint 3, get_netdump_regs_x86_64 (bt=0x7fffffffcd70, > ripp=0x7fffffffcce0, > rspp=0x7fffffffcce8) at netdump.c:2306 > 2306 bt->machdep = (void *)user_regs; > (gdb) p user_regs > $1 = 0xd14084 "" > (gdb) c > Continuing. > > Breakpoint 1, get_netdump_regs_x86_64 (bt=0x7fffffffcd70, > ripp=0x7fffffffcce0, > rspp=0x7fffffffcce8) at netdump.c:2238 > 2238 { > (gdb) c > Continuing. > > Program received signal SIGSEGV, Segmentation fault. > x86_64_get_dumpfile_stack_frame (rsp=0x7fffffffcce8, > rip=0x7fffffffcce0, > bt_in=0x7fffffffcd70) at x86_64.c:4183 > 4183 ur_rip = ULONG(user_regs + > > > *** So in its second invocation, get_netdump_regs_x86_64() never sets > bt->machdep (only breakpoint 1 fired) > > *** Let's see what diskdump_get_prstatus_percpu() is returning > > (gdb) break diskdump_get_prstatus_percpu > Breakpoint 1 at 0x526070: file diskdump.c, line 1451. > (gdb) r > please wait... (determining panic task) > Breakpoint 1, diskdump_get_prstatus_percpu (cpu=0) at diskdump.c:1451 > 1451 return dd->nt_prstatus_percpu[cpu]; > (gdb) display dd->nt_prstatus_percpu[0]@16 > 1: dd->nt_prstatus_percpu[0]@16 = {0xd1c000, 0x0, 0x0, 0xd26472, > 0xbf35ab2, > 0xd26472, 0x200000012, 0xd1c850, 0xd1c600, 0x1010000012b, > 0xffffffff814e4fa0, 0x14e4fa0, 0x4270, 0x0, 0x0, 0x0} > (gdb) c > Continuing. > > Breakpoint 1, diskdump_get_prstatus_percpu (cpu=1) at diskdump.c:1451 > 1451 return dd->nt_prstatus_percpu[cpu]; > 1: dd->nt_prstatus_percpu[0]@16 = {0xd1c000, 0x0, 0x0, 0xd26472, > 0xbf35ab2, > 0xd26472, 0x200000012, 0xd1c850, 0xd1c600, 0x1010000012b, > 0xffffffff814e4fa0, 0x14e4fa0, 0x4270, 0x0, 0x0, 0x0} > > > *** See crash -d1 vmlinux vmcore output at the bottom of the mail, > particularly the part that says... > > crash: page excluded: kernel virtual address: ffffffff81bb3b00 type: > "cpu number (per_cpu)" > crash: get_cpus_present: present: 16 > > > > *** Bogus note->n_descsz value > *** Apply first patch to get us further into ELF Note processing > > >From inside netdump.c :: get_regs_from_note() at the point of the > >fault, I don't see dd->nt_prstatus[], for dd is now type > >*diskdump_data... The *note passed in can be found in > >dd->nt_prstatus_percpu[] however... > > please wait... (determining panic task) > Program received signal SIGSEGV, Segmentation fault. > get_regs_from_note (note=0xd26472 "\b", ip=0x7fffffffc590, > sp=0x7fffffffc598) > at netdump.c:2221 > 2221 *sp = ULONG(user_regs + offset_sp); > (gdb) p/x *((Elf64_Nhdr *)note) > $1 = {n_namesz = 0x8, n_descsz = 0xccf80000, n_type = 0x8} > (gdb) p dd->nt_prstatus_percpu[0]@16 > $2 = {0xd1c000, 0x0, 0x0, 0xd26472, 0xbf35ab2, 0xd26472, 0x200000012, > 0xd1c850, 0xd1c600, 0x1010000012b, 0xffffffff814e4fa0, 0x14e4fa0, > 0x4270, > 0x0, 0x0, 0x0} > (gdb) ptype dd > type = struct diskdump_data { > char *filename; > ulong flags; > int dfd; > FILE *ofp; > int machine_type; > struct disk_dump_header *header; > struct disk_dump_sub_header *sub_header; > struct kdump_sub_header *sub_header_kdump; > size_t data_offset; > int block_size; > int block_shift; > char *bitmap; > int bitmap_len; > char *dumpable_bitmap; > int byte; > int bit; > char *compressed_page; > char *curbufptr; > unsigned char *notes_buf; > void **nt_prstatus_percpu; > uint num_prstatus_notes; > struct page_cache_hdr page_cache_hdr[16]; > char *page_cache_buf; > int evict_index; > ulong evictions; > ulong cached_reads; > ulong *valid_pages; > ulong accesses; > } * > > > > *** Unpatched crash -d1 vmlinux vmcore output: > > crash 5.1.8 > Copyright (C) 2002-2011 Red Hat, Inc. > Copyright (C) 2004, 2005, 2006 IBM Corporation > Copyright (C) 1999-2006 Hewlett-Packard Co > Copyright (C) 2005, 2006 Fujitsu Limited > Copyright (C) 2006, 2007 VA Linux Systems Japan K.K. > Copyright (C) 2005 NEC Corporation > Copyright (C) 1999, 2002, 2007 Silicon Graphics, Inc. > Copyright (C) 1999, 2000, 2001, 2002 Mission Critical Linux, Inc. > This program is free software, covered by the GNU General Public > License, > and you are welcome to change it and/or distribute copies of it under > certain conditions. Enter "help copying" to see the conditions. > This program has absolutely no warranty. Enter "help warranty" for > details. > > compressed kdump: header->utsname.machine: x86_64 > diskdump_data: > filename: vmcore > flags: 6 (KDUMP_CMPRS_LOCAL|ERROR_EXCLUDED) > dfd: 3 > ofp: 0 > machine_type: 62 (EM_X86_64) > > header: 2cc1fe0 > signature: "KDUMP " > header_version: 4 > utsname: > sysname: Linux > nodename: bahamut.mno.stratus.com > release: 2.6.32-131.0.15.el6.exp10.bz16586.x86_64 > version: #1 SMP Thu Jun 16 13:13:45 EDT 2011 > machine: x86_64 > domainname: sraeng > timestamp: > tv_sec: 4e4fe6e3 > tv_usec: 0 > status: 0 () > block_size: 4096 > sub_hdr_size: 1 > bitmap_blocks: 288 > max_mapnr: 4718592 > total_ram_blocks: 0 > device_blocks: 0 > written_blocks: 0 > current_cpu: 0 > nr_cpus: 1 > tasks[nr_cpus]: 0 > > sub_header: 0 (n/a) > > sub_header_kdump: 2cc2ff0 > phys_base: 0 > dump_level: 31 (0x1f) > (DUMP_EXCLUDE_ZERO|DUMP_EXCLUDE_CACHE|DUMP_EXCLUDE_CACHE_PRI|DUMP_EXCLUD E_USER_DATA|DUMP_EXCLUDE_FREE) > offset_vmcoreinfo: 11bc > size_vmcoreinfo: 1392 > OSRELEASE=2.6.32-131.0.15.el6.exp10.bz16586.x86_64 > PAGESIZE=4096 > SYMBOL(init_uts_ns)=ffffffff81a2e8c0 > SYMBOL(node_online_map)=ffffffff81ba0860 > SYMBOL(swapper_pg_dir)=ffffffff81a25000 > SYMBOL(_stext)=ffffffff81000198 > SYMBOL(vmlist)=ffffffff81ee60b8 > SYMBOL(mem_section)=ffffffff81ef03c0 > LENGTH(mem_section)=4096 > SIZE(mem_section)=32 > OFFSET(mem_section.section_mem_map)=0 > SIZE(page)=56 > SIZE(pglist_data)=212416 > SIZE(zone)=34496 > SIZE(free_area)=88 > SIZE(list_head)=16 > SIZE(nodemask_t)=64 > OFFSET(page.flags)=0 > OFFSET(page._count)=8 > OFFSET(page.mapping)=24 > OFFSET(page.lru)=40 > OFFSET(pglist_data.node_zones)=0 > OFFSET(pglist_data.nr_zones)=212288 > OFFSET(pglist_data.node_start_pfn)=212312 > OFFSET(pglist_data.node_spanned_pages)=212328 > OFFSET(pglist_data.node_id)=212336 > OFFSET(zone.free_area)=32864 > OFFSET(zone.vm_stat)=34032 > OFFSET(zone.spanned_pages)=34344 > OFFSET(free_area.free_list)=0 > OFFSET(list_head.next)=0 > OFFSET(list_head.prev)=8 > OFFSET(vm_struct.addr)=8 > LENGTH(zone.free_area)=11 > SYMBOL(log_buf)=ffffffff81a37210 > SYMBOL(log_end)=ffffffff81d5b820 > SYMBOL(log_buf_len)=ffffffff81a37208 > SYMBOL(logged_chars)=ffffffff81ddb920 > LENGTH(free_area.free_list)=5 > NUMBER(NR_FREE_PAGES)=0 > NUMBER(PG_lru)=5 > NUMBER(PG_private)=11 > NUMBER(PG_swapcache)=16 > SYMBOL(phys_base)=ffffffff81a2d010 > SYMBOL(init_level4_pgt)=ffffffff81a25000 > SYMBOL(node_data)=ffffffff81b9cda0 > LENGTH(node_data)=512 > CRASHTIME=1313859299 > offset_note: 1040 > size_note: 1780 > num_prstatus_notes: 1 > notes_buf: 2cc4000 > notes[0]: 2cc4000 > NT_PRSTATUS_offset: 1040 > > data_offset: 122000 > block_size: 4096 > block_shift: 12 > bitmap: 7fa5296fc010 > bitmap_len: 1179648 > dumpable_bitmap: 7fa528890010 > byte: 0 > bit: 0 > compressed_page: 2cdeb30 > curbufptr: 0 > > page_cache_hdr[0]: > pg_flags: 0 () > pg_addr: 0 > pg_bufptr: 2cceb20 > pg_hit_count: 0 > page_cache_hdr[1]: > pg_flags: 0 () > pg_addr: 0 > pg_bufptr: 2ccfb20 > pg_hit_count: 0 > page_cache_hdr[2]: > pg_flags: 0 () > pg_addr: 0 > pg_bufptr: 2cd0b20 > pg_hit_count: 0 > page_cache_hdr[3]: > pg_flags: 0 () > pg_addr: 0 > pg_bufptr: 2cd1b20 > pg_hit_count: 0 > page_cache_hdr[4]: > pg_flags: 0 () > pg_addr: 0 > pg_bufptr: 2cd2b20 > pg_hit_count: 0 > page_cache_hdr[5]: > pg_flags: 0 () > pg_addr: 0 > pg_bufptr: 2cd3b20 > pg_hit_count: 0 > page_cache_hdr[6]: > pg_flags: 0 () > pg_addr: 0 > pg_bufptr: 2cd4b20 > pg_hit_count: 0 > page_cache_hdr[7]: > pg_flags: 0 () > pg_addr: 0 > pg_bufptr: 2cd5b20 > pg_hit_count: 0 > page_cache_hdr[8]: > pg_flags: 0 () > pg_addr: 0 > pg_bufptr: 2cd6b20 > pg_hit_count: 0 > page_cache_hdr[9]: > pg_flags: 0 () > pg_addr: 0 > pg_bufptr: 2cd7b20 > pg_hit_count: 0 > page_cache_hdr[10]: > pg_flags: 0 () > pg_addr: 0 > pg_bufptr: 2cd8b20 > pg_hit_count: 0 > page_cache_hdr[11]: > pg_flags: 0 () > pg_addr: 0 > pg_bufptr: 2cd9b20 > pg_hit_count: 0 > page_cache_hdr[12]: > pg_flags: 0 () > pg_addr: 0 > pg_bufptr: 2cdab20 > pg_hit_count: 0 > page_cache_hdr[13]: > pg_flags: 0 () > pg_addr: 0 > pg_bufptr: 2cdbb20 > pg_hit_count: 0 > page_cache_hdr[14]: > pg_flags: 0 () > pg_addr: 0 > pg_bufptr: 2cdcb20 > pg_hit_count: 0 > page_cache_hdr[15]: > pg_flags: 0 () > pg_addr: 0 > pg_bufptr: 2cddb20 > pg_hit_count: 0 > > page_cache_buf: 2cceb20 > evict_index: 0 > evictions: 0 > accesses: 0 > cached_reads: 0 > valid_pages: 2ccc710 > crash: pv_init_ops exists: ARCH_PVOPS > compressed kdump: phys_base: 0 > gdb vmlinux > GNU gdb (GDB) 7.0 > Copyright (C) 2009 Free Software Foundation, Inc. > License GPLv3+: GNU GPL version 3 or later < > http://gnu.org/licenses/gpl.html > > This is free software: you are free to change and redistribute it. > There is NO WARRANTY, to the extent permitted by law. Type "show > copying" > and "show warranty" for details. > This GDB was configured as "x86_64-unknown-linux-gnu"... > > cpu_possible_map: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 > cpu_present_map: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 > cpu_online_map: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 > base kernel version: 2.6.32 > verify_namelist: > dumpfile /proc/version: > Linux version 2.6.32-131.0.15.el6.exp10.bz16586.x86_64 > (root@xxxxxxxxxxxxxxxxxxxx) (gcc version 4.4.5 20110214 (Red Hat > 4.4.5-6) (GCC) ) #1 SMP Thu Jun 16 13:13:45 EDT 2011 > vmlinux: > Linux version 2.6.32-131.0.15.el6.exp10.bz16586.x86_64 > (root@xxxxxxxxxxxxxxxxxxxx) (gcc version 4.4.5 20110214 (Red Hat > 4.4.5-6) (GCC) ) #1 SMP Thu Jun 16 13:13:45 EDT 2011 > > crash: page excluded: kernel virtual address: ffffffff81bb3b00 type: > "cpu number (per_cpu)" > crash: get_cpus_present: present: 16 > crash: page excluded: kernel virtual address: ffffffff81bb3b00 type: > "cpu number (per_cpu)" > crash: get_cpus_present: present: 16 > IRQ stack link register: undetermined > PAGESIZE=4096 > mem_section_size = 32768 > NR_SECTION_ROOTS = 4096 > NR_MEM_SECTIONS = 524288 > SECTIONS_PER_ROOT = 128 > SECTION_ROOT_MASK = 0x7f > PAGES_PER_SECTION = 32768 > node_online_map: [3, 0, 0, 0, 0, 0, 0, 0] -> nodes online: 2 > node_table[0]: > id: 0 > pgdat: ffff880000020040 > size: 0 > present: 0 > mem_map: ffffea0000000000 > start_paddr: 0 > start_mapnr: 0 > WARNING: sparsemem: invalid section number: 137438888923 > WARNING: sparsemem: invalid section number: 137438888923 > crash: invalid kernel virtual address: 0 type: "readstring > characters" > crash: invalid kernel virtual address: 0 type: "readstring > characters" > node_table[1]: > id: 1 > pgdat: ffff880280000040 > size: 2097152 > present: 2097152 > mem_map: ffffea0008c00000 > start_paddr: 280000000 > start_mapnr: 2621440 > NOTE: page_hash_table does not exist in this kernel > ^Mplease wait... (gathering kmem slab cache data) > kmem_cache_downsize: SIZE(kmem_cache_s): 36968 > cache_cache.buffer_size: 32896 > kmem_cache_downsize: nr_node_ids: 2 > ^M ^MNOTE: unwind_table structure has changed, or does not exist in > this kernel > init_unwind_table: DWARF_UNWIND_EH_FRAME > ^Mplease wait... (gathering module symbol data)^M ^M^Mplease wait... > (gathering task table data)^M ^Mcrash: get_cpus_online: online: 16 > ^Mplease wait... (determining panic task) > crash: get_active_set_panic_task: failed > > > Thanks, > > -- Joe Lawrence > -- > Crash-utility mailing list > Crash-utility@xxxxxxxxxx > https://www.redhat.com/mailman/listinfo/crash-utility > -- Crash-utility mailing list Crash-utility@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/crash-utility -- Crash-utility mailing list Crash-utility@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/crash-utility