Re: Unable to switch stack frames while using crash

"Shashidhara Shamaiah" <Shashidhara.Shamaiah@xxxxxxxxxxx> · Thu, 23 Jun 2011 16:59:35 +0530

Hi Dave,

Thanks for the help. 

I have some doubts regarding kdump and crash utility.

I am analyzing a vmcore dump caused by an oops in customer location
using crash utility.The oops report is below

[345132.723424] BUG: unable to handle kernel NULL pointer dereference at
0000000000000005
[345132.724928] IP: [<ffffffff811f03b3>] n_tty_read+0x58c/0x818
[345132.726100] PGD 2c8e03067 PUD 2cbd88067 PMD 0
[345132.727187] Oops: 0000 [#1] SMP
[345132.727879] last sysfs file: /sys/block/loop7/dev
[345132.728935] CPU 1
[345132.729396] Modules linked in: xt_tcpudp iptable_filter ip_tables
x_tables strmfs_mod bond0 ipmi_devintf hpwdt sctp ipv6 crc32c libcrc32c
loop ipmi_si tpm_tis ipmi_msghandler hpilo tpm tpm_bios psmouse
serio_raw shpchp pci_hotplug container processor evdev ext3 jbd mbcache
dm_mirror dm_region_hash dm_log dm_snapshot dm_mod sg sr_mod cdrom
ide_pci_generic ide_core usbhid hid ata_piix ata_generic libata ehci_hcd
bnx2 uhci_hcd e1000e cciss scsi_mod button thermal fan thermal_sys edd
[last unloaded: scsi_wait_scan]
[345132.739511] Pid: 13366, comm: telnet Not tainted
2.6.32-cdma-18-amd64 #1 ProLiant DL380 G6
[345132.741423] RIP: 0010:[<ffffffff811f03b3>]  [<ffffffff811f03b3>]
n_tty_read+0x58c/0x818
[345132.743220] RSP: 0018:ffff88031ce75da8  EFLAGS: 00010246
[345132.744469] RAX: 0000000000000000 RBX: ffff8802cbd54a68 RCX:
000000000061c044
[345132.746061] RDX: 0000000000000005 RSI: ffff88031ce75e87 RDI:
ffff8802cbd54d1c
[345132.747726] RBP: ffff88031ce75eb8 R08: 0000000000000000 R09:
0000000000000000
[345132.749391] R10: 0000000000616680 R11: 0000000000000246 R12:
000000000061c044
[345132.750981] R13: ffff8802cbd54800 R14: 0000000000000000 R15:
7fffffffffffffff
[345132.752650] FS:  00007ffff7fee6f0(0000) GS:ffff880033020000(0000)
knlGS:0000000000000000
[345132.754569] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[345132.755915] CR2: 0000000000000005 CR3: 000000030c408000 CR4:
00000000000006e0
[345132.757579] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
0000000000000000
[345132.759169] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7:
0000000000000400
[345132.760778] Process telnet (pid: 13366, threadinfo ffff88031ce74000,
task ffff88031b60d580)
[345132.762707] Stack:
[345132.763162]  ffff88031b60d580 ffff88031b60d580 ffff88031b60d580
ffff88031b60d580
[345132.764791] <0> 000000000061c02b 0000000000000000 0000000000000000
000000000061c02a
[345132.766510] <0> ffff8802de651a40 ffff8802cbd549c0 ffff8802cbd54c90
ffff8802cbd54d1c
[345132.768270] Call Trace:
[345132.768877]  [<ffffffff81045f84>] ? default_wake_function+0x0/0xf
[345132.770309]  [<ffffffff811ebf7e>] tty_read+0x7d/0xba
[345132.771526]  [<ffffffff810ebcc8>] vfs_read+0xab/0x167
[345132.772541]  [<ffffffff810ebe48>] sys_read+0x47/0x6f
[345132.773526]  [<ffffffff8100bbc2>] system_call_fastpath+0x16/0x1b
[345132.774652] Code: 00 41 8b 85 5c 02 00 00 48 8b 9d 78 ff ff ff f0 0f
b3 03 45 19 f6 49 63 95 5c 02 00 00 49 8b 85 50 02 00 00 48 8b bd 48 ff
ff ff <0f> be 1c 10 e8 fc 6b 0e 00 48 89 c6 41 8b 85 5c 02 00 00 41 ff
[345132.778840] RIP  [<ffffffff811f03b3>] n_tty_read+0x58c/0x818
[345132.780107]  RSP <ffff88031ce75da8>
[345132.780969] CR2: 0000000000000005
[345132.781786] hpwdt: New timer passed in is 120 seconds.
[345132.782942] hpwdt: timer reset to 120 for kdump

 After analysis, we figured out that the crash occurs in the function
n_read_tty of kernel-source/drivers/char/n_tty.c . The oops occurred on
linux kernel 2.6.32. Below is the code fragment where the page fault
occurred. The page fault occurs when executing the statement c =
tty->read_buf[tty->read_tail] .

/* N.B. avoid overrun if nr == 0 */
                        while (nr && tty->read_cnt) {

                                      int eol;

                                eol = test_and_clear_bit(tty->read_tail,
                                                tty->read_flags);
                                c = tty->read_buf[tty->read_tail]; //
page fault statement after analyzing oops

spin_lock_irqsave(&tty->read_lock, flags);  
                                tty->read_tail = ((tty->read_tail+1) &
                                                  (N_TTY_BUF_SIZE-1));
                                tty->read_cnt--;
                                if (eol) {
                                        /* this test should be
redundant:
                                         * we shouldn't be reading data
if
                                         * canon_data is 0
                                         */
                                        if (--tty->canon_data < 0)
                                                tty->canon_data = 0;
                                }
                                spin_unlock_irqrestore(&tty->read_lock,
flags);

Below is the contents of the structure tty_struct ( at the time of oops
). This was passed as an argument to the function n_read_tty().

	tty_struct ffff8802cbd54800
struct tty_struct { ...
  magic = 21505,
  driver = 0xffff88031b54ea00,
  ops = 0xffffffff8130f650,
  name = "pts9\000\...",
  driver_data = 0xffff88029c8a9668,
  icanon = 1 '\001',
  read_buf = 0xffff8802cbfe6000 "",
  read_head = 0,
  read_tail = 0,
  read_cnt = 0,
  read_flags = {0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0},
  canon_data = 0,
  ......................................

As per crash utility the field read_cnt is 0 when kernel oopsed.In that
case, the statement while (nr && tty->read_cnt) in the above code
fragment should have failed. This leads me to think that there was some
other thread/task in kernel which should have  updated the read_cnt
field in parallel. However the crash utility reports that the runqueue
of all CPUs at the time of crash as idle. Except CPU1 which was
executing the user program telnet in kernel context ( system call ).
Below is the runqueue output.

CPU 0 RUNQUEUE: ffff880033012d80
  CURRENT: PID: 0      TASK: ffffffff814204b0  COMMAND: "swapper"
  RT PRIO_ARRAY: ffff880033012e98
     [no tasks queued]
  CFS RB_ROOT: ffff880033012e10
     [no tasks queued]

CPU 1 RUNQUEUE: ffff880033032d80
  CURRENT: PID: 13366  TASK: ffff88031b60d580  COMMAND: "telnet"
  RT PRIO_ARRAY: ffff880033032e98
     [no tasks queued]
  CFS RB_ROOT: ffff880033032e10
     [no tasks queued]

CPU 2 RUNQUEUE: ffff880033052d80
  CURRENT: PID: 0      TASK: ffff88031e0e3540  COMMAND: "swapper"
  RT PRIO_ARRAY: ffff880033052e98
     [no tasks queued]
  CFS RB_ROOT: ffff880033052e10
     [no tasks queued]

CPU 3 RUNQUEUE: ffff880033072d80
  CURRENT: PID: 0      TASK: ffff88031e113580  COMMAND: "swapper"
  RT PRIO_ARRAY: ffff880033072e98
     [no tasks queued]
  CFS RB_ROOT: ffff880033072e10
     [no tasks queued]

How is this logically possible. Crash reports there are no tasks running
currently. Or before the oops trigger and kdump capturing the memory
image, some process/thread ran which could have updated the data
structure. I wanted to know if this scenario is possible. I kindly
request your suggestion/guidance. Please let me know if you need any
other details.

Regards
Shashidhara

-----Original Message-----
From: crash-utility-bounces@xxxxxxxxxx
[mailto:crash-utility-bounces@xxxxxxxxxx] On Behalf Of Dave Anderson
Sent: Tuesday, June 21, 2011 7:24 PM
To: Discussion list for crash utility usage,maintenance and development
Subject: Re:  Unable to switch stack frames while using
crash

----- Original Message -----
> Hi Dave,
> 
> I updated the makedumpfile utility from 1.3.5 to 1.3.7 . When I run
the
> below command
> 
> makedumpfile -c -d 31 -x vmlinux_temp vmcore vmcore-new
> The kernel version is not supported.
> The created dumpfile may be incomplete.
> check_release: Can't get the kernel version.
> makedumpfile Failed.

I see that makedumpfile-1.3.8 was recently released, but it still
has a LATEST_VERSION of 2.6.36:

  #define OLDEST_VERSION          KERNEL_VERSION(2, 6, 15)/*
linux-2.6.15 */
  #define LATEST_VERSION          KERNEL_VERSION(2, 6, 36)/*
linux-2.6.36 */

You haven't stated what your kernel version is, but it seems
makedumpfile
cannot get past this point.  On the other hand, the compressed kdump was
created, so I'm not entirely clear.

> Is there any other way to extract the ELF style vmcore file from the
> kdump compressed format. Please guide me.

I don't believe so...

But I'm not the makedumpfile maintainer, so I'd prefer not to give any
definitive answers to your questions.  I've cc'd the upstream maintainer
of makedumpfile.

Thanks,
  Dave

--
Crash-utility mailing list
Crash-utility@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/crash-utility

Information transmitted by this e-mail is proprietary to MphasiS, its associated companies and/ or its customers and is intended 
for use only by the individual or entity to which it is addressed, and may contain information that is privileged, confidential or 
exempt from disclosure under applicable law. If you are not the intended recipient or it appears that this mail has been forwarded 
to you without proper authority, you are notified that any use or dissemination of this information in any manner is strictly 
prohibited. In such cases, please notify us immediately at mailmaster@xxxxxxxxxxx and delete this mail from your records.

--
Crash-utility mailing list
Crash-utility@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/crash-utility