Re: how to use crash utility to parse the binary memory dump

Dave Anderson <anderson@xxxxxxxxxx> · Fri, 25 Jul 2014 15:55:14 -0400 (EDT)

----- Original Message -----
> 
> Modified patch attached. It is rebased to latest crash version.
> The arguments are in the form of ordered pair as you had mentioned. I
> have tested it with arm and armv8 ramdumps.
> 
> Do we really need  dump_ramdump_def ? As the dump is converted to
> kdump and we use the kdump flag in pc->flags, help -D and help -n
> works fine using kdump dump functions. Did I miss something ?
> 
> I will send you the link to arm64 ramdump in another email.
> 
> Thanks,
> Vinayak

I tested your latest patch on the sample ARM and ARM64 RAM dumps
you sent me.

As far as the patch itself is concerned, I ran into a problem
where if crash is invoked in a directory where it does not have
write permission, the session hangs trying to write to a bad file 
descriptor -- because of this:

        fd2 = open(out_elf, O_CREAT|O_RDWR, S_IRUSR|S_IWUSR);
        if (!fd2) {
                error(INFO, "%s open error\n", out_elf);
                goto end1;
        }

It should be "if (fd2 < 0)".

But more to the point, in my earlier response, I had suggested this: 

> With respect to the [-o output_file], and given the potential 
> simplicity of the argument string, I think it should be
> optional.  You could do something like this in the getopt() 
> handler, and have the ELF output_file name pre-stored in the 
> ramdump_def structure:
> 
> +              case 'o':
> +                       ramdump_elf_output_file(optarg);
> +                       break;
> 
> If "-o output_file" is NOT used, then ramdump_to_elf() can
> pass back the name of a temporary file.

I should have been more clear w/respect to "a temporary file".
what I was suggesting was that you do something like using
mkstemp(3) to create a temporay file in /var/tmp, and then
unlink() it immediately so it would only exist until the crash
session ends.

I tested this patch on your sample ARM and ARM64 RAM dumps.
The 32-bit ARM dumpfile can be analyzed OK, but as you noted, 
the ARM64 dump requires "--cpus 4" to come OK, which really
should not be required.

Investigating the reason for the "--cpus 4" requirement, it's
helpful to compare the two dumps.  With your sample 32-bit ARM
dumpfile, although it comes up OK with 4 cpus, note that only
cpu 1 is marked online in the kernel:

  crash> help -k
  ...
       cpu_possible_map: 0 1 2 3 
        cpu_present_map: 0 1 2 3 
         cpu_online_map: 1 
         cpu_active_map: 0 1 2 3
  ...

The 32-bit arm.c arm_get_smp_cpus() function calculates the 
number of cpus like this:

  return MAX(get_cpus_active(), get_cpus_online());

so it returns 4 since the "active" map shows all 4 cpus.

The "ps" command shows tasks associated with all 4 cpus, and the 
runqueues look like this, where cpus 0 and 2 have their idle
task running, and cpus 1 and 3 have user-mode tasks running:

  crash> runq
  CPU 0 RUNQUEUE: c0f286c0
    CURRENT: PID: 0      TASK: c0a5d8b0  COMMAND: "swapper/0"
    RT PRIO_ARRAY: c0f287a0
       [no tasks queued]
    CFS RB_ROOT: c0f28730
       [no tasks queued]

  CPU 1 RUNQUEUE: c0f316c0
    CURRENT: PID: 13429  TASK: db944580  COMMAND: "AudioIn_5F8"
    RT PRIO_ARRAY: c0f317a0
       [no tasks queued]
    CFS RB_ROOT: c0f31730
       [120] PID: 474    TASK: d9a36ac0  COMMAND: "kworker/1:1"
       [120] PID: 2890   TASK: c89b2580  COMMAND: "sh"

  CPU 2 RUNQUEUE: c0f3a6c0
    CURRENT: PID: 0      TASK: db63a040  COMMAND: "swapper/2"
    RT PRIO_ARRAY: c0f3a7a0
       [no tasks queued]
    CFS RB_ROOT: c0f3a730
       [112] PID: 1599   TASK: db87d040  COMMAND: "mm_device_threa"

  CPU 3 RUNQUEUE: c0f436c0
    CURRENT: PID: 1949   TASK: db951040  COMMAND: "WindowManager"
    RT PRIO_ARRAY: c0f437a0
       [no tasks queued]
    CFS RB_ROOT: c0f43730
       [no tasks queued]
  crash> 

So it does seem that whatever mechanism you use to take the
raw RAM dump on the 32-bit ARM offlines cpus first?

Now, on the ARM64 dumpfile, if I force it to come up with "--cpus 4"
it shows that only cpu 0 is online, present and active:

  crash> help -k
  ...
      cpu_possible_map: 0 1 2 3 
        cpu_present_map: 0 
         cpu_online_map: 0 
         cpu_active_map: 0
  ...

I can understand that perhaps cpus are offlined prior to taking
the RAM dump, but it's strange that the "present" and "active"
maps are also the same as the "online" map?

Currently the arm64.c arm64_get_smp_cpus() returns the number of
cpus like this:

  return MAX(get_cpus_online(), get_highest_cpu_online()+1);

so it returns 1.  Even if it did it the same as the 32-bit ARM,
it would still return 1 because of the active map.  

So we have to force it to return 4 with "--cpus 4".  But having done
that, oddly enough, the "runq" command shows this, where the "CURRENT"
task on cpu 0 is "0":

  crash> runq
  CPU 0 RUNQUEUE: ffffffc03ffb6e40
    CURRENT: 0
    RT PRIO_ARRAY: ffffffc03ffb6fb0
       [no tasks queued]
    CFS RB_ROOT: ffffffc03ffb6f10
       [no tasks queued]

  CPU 1 RUNQUEUE: ffffffc03ffc1e40
    CURRENT: PID: 0      TASK: ffffffc03ecb4b00  COMMAND: "swapper/1"
    RT PRIO_ARRAY: ffffffc03ffc1fb0
       [no tasks queued]
    CFS RB_ROOT: ffffffc03ffc1f10
       [no tasks queued]

  CPU 2 RUNQUEUE: ffffffc03ffcce40
    CURRENT: PID: 0      TASK: ffffffc03ecb5dc0  COMMAND: "swapper/2"
    RT PRIO_ARRAY: ffffffc03ffccfb0
       [no tasks queued]
    CFS RB_ROOT: ffffffc03ffccf10
       [no tasks queued]

  CPU 3 RUNQUEUE: ffffffc03ffd7e40
    CURRENT: PID: 0      TASK: ffffffc03ecf0000  COMMAND: "swapper/3"
    RT PRIO_ARRAY: ffffffc03ffd7fb0
       [no tasks queued]
    CFS RB_ROOT: ffffffc03ffd7f10
       [no tasks queued]
  crash> 

I have never seen this before -- As I understand it, if no other
task is queued and run on a cpu, then it defaults to the idle/swapper
task for that cpu, whose address is hard-wired in the per-cpu runqueue
structure.  But if I look at the rq structure for cpu 0, not only is
the "curr" task pointer NULL, the "idle" task pointer is also:

  crash> rq.curr,idle,cpu ffffffc03ffb6e40
    curr = 0x0
    idle = 0x0
    cpu = 0
  crash>

whereas the other 3 cpus show that they are running their idle tasks:

  crash> rq.curr,idle,cpu ffffffc03ffc1e40
    curr = 0xffffffc03ecb4b00
    idle = 0xffffffc03ecb4b00
    cpu = 1
  crash> rq.curr,idle,cpu ffffffc03ffcce40
    curr = 0xffffffc03ecb5dc0
    idle = 0xffffffc03ecb5dc0
    cpu = 2
  crash> rq.curr,idle,cpu ffffffc03ffd7e40
    curr = 0xffffffc03ecf0000
    idle = 0xffffffc03ecf0000
    cpu = 3
  crash>

Perhaps it has something to do with *when* you took the dump.
The "sys" command shows an UPTIME of 00:00:00:

  crash> sys
        KERNEL: /home/anderson/Downloads/tmp_ARM64/vmlinux
      DUMPFILE: ramdump_elf
          CPUS: 4
          DATE: Wed Dec 31 19:00:00 1969
        UPTIME: 00:00:00
  LOAD AVERAGE: 0.00, 0.00, 0.00
         TASKS: 34
      NODENAME: (none)
       RELEASE: 3.10.33+
       VERSION: #22 SMP PREEMPT Tue May 6 16:23:34 IST 2014
       MACHINE: aarch64  (unknown Mhz)
        MEMORY: 1 GB
         PANIC: ""
  crash> 

And the "ps" command doesn't show any user-space tasks running,
not even "init" PID 1, and the funky idle/swapper task on cpu 0
shows a PID of 1:

  crash> ps
     PID    PPID  CPU       TASK        ST  %MEM     VSZ    RSS  COMM
  >     0     -1   1  ffffffc03ecb4b00  RU   0.0       0      0  [swapper/1]
  >     0     -1   2  ffffffc03ecb5dc0  RU   0.0       0      0  [swapper/2]
  >     0     -1   3  ffffffc03ecf0000  RU   0.0       0      0  [swapper/3]
        1     -1   0  ffffffc03ec78000  UN   0.0       0      0  [swapper/0]
        2     -1   0  ffffffc03ec792c0  IN   0.0       0      0  [kthreadd]
        3      2   0  ffffffc03ec7a580  IN   0.0       0      0  [ksoftirqd/0]
        4      2   0  ffffffc03ec7b840  IN   0.0       0      0  [kworker/0:0]
        5      2   0  ffffffc03ec7cb00  IN   0.0       0      0  [kworker/0:0H]
        6      2   0  ffffffc03ec7ddc0  IN   0.0       0      0  [kworker/u8:0]
        7      2   0  ffffffc03ecb0000  IN   0.0       0      0  [migration/0]
        8      2   0  ffffffc03ecb12c0  IN   0.0       0      0  [rcu_preempt]
        9      2   0  ffffffc03ecb2580  IN   0.0       0      0  [rcu_bh]
       10      2   0  ffffffc03ecb3840  IN   0.0       0      0  [rcu_sched]
       11      2   1  ffffffc03ecf12c0  ??   0.0       0      0  [migration/1]
       12      2   1  ffffffc03ecf2580  ??   0.0       0      0  [ksoftirqd/1]
       13      2   1  ffffffc03ecf3840  IN   0.0       0      0  [kworker/1:0]
       14      2   1  ffffffc03ecf4b00  IN   0.0       0      0  [kworker/1:0H]
       15      2   2  ffffffc03ecf5dc0  ??   0.0       0      0  [migration/2]
       16      2   2  ffffffc03ed20000  IN   0.0       0      0  [ksoftirqd/2]
       17      2   0  ffffffc03ed212c0  UN   0.0       0      0  [kworker/2:0]
       18      2   2  ffffffc03ed22580  IN   0.0       0      0  [kworker/2:0H]
       19      2   3  ffffffc03ed23840  IN   0.0       0      0  [migration/3]
       20      2   3  ffffffc03ed24b00  IN   0.0       0      0  [ksoftirqd/3]
       21      2   3  ffffffc03ed25dc0  IN   0.0       0      0  [kworker/3:0]
       22      2   3  ffffffc03ed40000  IN   0.0       0      0  [kworker/3:0H]
       23      2   0  ffffffc03ed412c0  IN   0.0       0      0  [khelper]
       24      2   0  ffffffc03ed42580  IN   0.0       0      0  [kdevtmpfs]
       25      2   0  ffffffc03ed43840  IN   0.0       0      0  [kworker/u8:1]
       56      2   0  ffffffc03ededdc0  IN   0.0       0      0  [bcm_ipc_ch0]
       57      2   0  ffffffc03edecb00  IN   0.0       0      0  [bcm_ipc_ch11]
      180      2   0  ffffffc03ee5ddc0  IN   0.0       0      0  [writeback]
      182      2   0  ffffffc03ee912c0  IN   0.0       0      0  [bioset]
      184      2   0  ffffffc03ede92c0  IN   0.0       0      0  [kworker/u9:0]
      185      2   0  ffffffc03ede8000  IN   0.0       0      0  [kblockd]
  crash> 

So I'm guessing that this dumpfile was taken before the "init" task was even
created, and the kernel data structures were not fully initialized?

Maybe you can try taking a RAM dump on an ARM64 machine after
it is up and running?

Thanks,
  Dave

--
Crash-utility mailing list
Crash-utility@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/crash-utility