Re: [Crash-utility] crash 4.0-2.8 fails on 2.6.14-rc5 (EM64T)

Badari Pulavarty <pbadari@xxxxxxxxxx> · Thu, 27 Oct 2005 11:27:01 -0700

On Thu, 2005-10-27 at 14:16 -0400, Dave Anderson wrote:
> Badari Pulavarty wrote: 
> > On Thu, 2005-10-27 at 13:17 -0400, Dave Anderson wrote: 
> > > Badari Pulavarty wrote: 
> > > > > That debug output certainly seems to pinpoint the issue at
> > hand, 
> > > > doesn't it? 
> > > > > Very interesting... 
> > > > > 
> > > > > What's strange is that the usage of the cpu_pda[i].data_offset
> > by 
> > > > the 
> > > > > per_cpu() macro in "include/asm-x86_64/percpu.h" is
> > unchanged. 
> > > > > 
> > > > > It's probably something very simple going on here, but I
> > don't 
> > > > have 
> > > > > any more ideas at this point. 
> > > > 
> > > > This is the reply I got from Andi Kleen.. 
> > > > 
> > > > -------- Forwarded Message -------- 
> > > > From: Andi Kleen <ak@xxxxxxx> 
> > > > To: Badari Pulavarty <pbadari@xxxxxxxxxx> 
> > > > Subject: Re: cpu_pda->data_offset changed recently ? 
> > > > Date: Thu, 27 Oct 2005 16:58:54 +0200 
> > > > On Thursday 27 October 2005 16:53, Badari Pulavarty wrote: 
> > > > > Hi Andi, 
> > > > > 
> > > > > I am trying to fix "crash" utility to make it work on 2.6.14-
> > rc5. 
> > > > > (Its running fine on 2.6.10). It looks like crash utility
> > reads 
> > > > > and uses cpu_pda->data_offset values. It looks like there is
> > a 
> > > > > change between 2.6.10 & 2.6.14-rc5 which is causing
> > "data_offset" 
> > > > > to be huge values - which is causing "crash" to break. 
> > > > > 
> > > > > I added printk() to find out why ? As you can see from
> > following 
> > > > > what changed - Is this expected ? Please let me know. 
> > > > 
> > > > bootmem used to allocate from the end of the direct mapping on
> > NUMA 
> > > > systems. Now it starts at the beginning, often before the 
> > > > kernel .text. 
> > > > This means it is negative. Perfectly legitimate. crash just has
> > to 
> > > > handle it. 
> > > > 
> > > > -Andi 
> > > > 
> > > > -- 
> > > > 
> > > That's what I thought it looked like, although the 
> > > x8664_pda.data_offset 
> > > field is an "unsigned long".  Anyway, if you take any of the 
> > > per_cpu__xxx 
> > > symbols from the 2.6.14 kernel, subtract a cpu data_offset, does
> > it 
> > > come up 
> > > with a legitimate virtual address? 
> > 
> > Unfortunately, I don't know x86-64 kernel virtual address space 
> > well enough to answer your question. 
> > 
> > My understanding is x86-64 kernel addresses look something like: 
> > 
> > addr: ffffffff80101000 
> > 
> > But now (2.6.14-rc5) I do see address like: 
> > 
> > pgdat: 0xffff81000000e000 
> > 
> > which are causing read problems. 
> > 
> > crash: read error: kernel virtual address: ffff81000000fa90  type: 
> > "pglist_data node_next" 
> > 
> > I am not sure what these address are and if they are valid. 
> > Is there a way to verify these addresses, through gdb or /dev/kmem 
> > or something like that ? 
> > 
> > Thanks, 
> > Badari
> > 
> > Here is bottom line we need to understand to fix
> > the problem.
> > 
> > 2.6.10:
> > pgdat: 0x1000000e000 
> > 
> > 2.6.14-rc5:
> > pgdat: 0xffff81000000e000
> 
> 
> Exactly. 
> 
> On a 2.6.9 kernel, if you do an nm -Bn on the vmlinux file, you'll
> first 
> see a bunch of "A" type absolute symbols, followed by the text 
> symbols, then readonly data, data, and so on.  Eventually you'll 
> bump into the per-cpu symbols: 
> 
> $ nm -Bn vmlinux 
> 0000000000088861 A __crc_dev_mc_delete 
> 000000000014bfd1 A __crc_smp_call_function 
> 00000000002de2e0 A __crc___skb_linearize 
> 0000000000442f14 A __crc_tty_register_device 
> 000000000060e766 A __crc_tty_termios_baud_rate 
> 0000000000712c54 A __crc_remove_inode_hash 
> 00000000007f8e0b A __crc_xfrm_policy_alloc 
> 0000000000801678 A __crc_flush_scheduled_work 
> 0000000000a64d75 A __crc_neigh_changeaddr 
> ...  <snip> 
> 00000000ffdf0b3d A __crc_usb_driver_release_interface 
> 00000000ffe031fc A __crc_udp_proc_unregister 
> 00000000ffead192 A __crc_cdrom_number_of_slots 
> 00000000fff9536b A __crc_sock_no_recvmsg 
> 00000000fffb8df8 A __crc_device_unregister 
> ffffffff80100000 t startup_32 
> ffffffff80100000 A _text 
> ffffffff80100081 t reach_compatibility_mode 
> ffffffff8010008e t second 
> ffffffff80100100 t reach_long64 
> ffffffff8010013d T initial_code 
> ffffffff80100145 T init_rsp 
> ffffffff80100150 T no_long_mode 
> ffffffff80100f00 T pGDT32 
> ffffffff80100f10 t ljumpvector 
> ffffffff80100f18 T stext 
> ffffffff80100f18 T _stext 
> ffffffff80101000 T init_level4_pgt 
> ffffffff80102000 T level3_ident_pgt 
> ...  <snip> 
> ffffffff80502100 D per_cpu__init_tss 
> ffffffff80502200 d per_cpu__prof_old_multiplier 
> ffffffff80502204 d per_cpu__prof_multiplier 
> ffffffff80502208 d per_cpu__prof_counter 
> ffffffff80502220 D per_cpu__mmu_gathers 
> ffffffff80503280 D per_cpu__kstat 
> ffffffff80503680 d per_cpu__runqueues 
> ffffffff805048e0 d per_cpu__cpu_domains 
> ffffffff80504940 d per_cpu__phys_domains 
> ffffffff805049a0 d per_cpu__node_domains 
> ffffffff805049f8 D per_cpu__process_counts 
> ffffffff80504a00 d per_cpu__tasklet_hi_vec 
> ffffffff80504a08 d per_cpu__tasklet_vec 
> ffffffff80504a10 d per_cpu__ksoftirqd 
> ffffffff80504a80 d per_cpu__tvec_bases 
> ffffffff80506b00 D per_cpu__rcu_bh_data 
> ffffffff80506b60 D per_cpu__rcu_data 
> ffffffff80506bc0 d per_cpu__rcu_tasklet 
> ... 
> 
> So for any data that was specifically created per-cpu, 
> the symbol above is the starting point, but to get to 
> the per-cpu structure, the offset value from the 
> cpu_data.data_offset needs to be applied. 
> 
> What I don't understand is where the 0xffff810000000000 
> addresses come into play.  Are you seeing them as actual 
> symbols? 
> 
> Dave 

It looks like level page table changed the layout. Now,
0xffff810000000000 is a valid.

Documentation/x86_64/mm.txt

Virtual memory map with 4 level page tables:

0000000000000000 - 00007fffffffffff (=47bits) user space, different per
mm
hole caused by [48:63] sign extension
ffff800000000000 - ffff80ffffffffff (=40bits) guard hole
ffff810000000000 - ffffc0ffffffffff (=46bits) direct mapping of phys.
memory
ffffc10000000000 - ffffc1ffffffffff (=40bits) hole
ffffc20000000000 - ffffe1ffffffffff (=45bits) vmalloc/ioremap space
... unused hole ...
ffffffff80000000 - ffffffff82800000 (=40MB)   kernel text mapping, from
phys 0
... unused hole ...
ffffffff88000000 - fffffffffff00000 (=1919MB) module mapping space

Thanks,
Badari