Re: Question for LKCD maintainers - How about adding a debug flag to crash and only calling abort() if crash is started with '-d' flag provided?

Dave Anderson <anderson@xxxxxxxxxx> · Wed, 02 Jan 2008 10:29:24 -0500

Piet Delaney wrote:
Dave Anderson wrote:

Long after I stopped tinkering with the LKCD code in crash,
changes were contributed to support physical memory zones
in the LKCD dumpfile format.


Hi Dave:

That could easily have been me.    I added zone support to the
LKCD kernel and lcrash code and then updated your crash code
to support zones. I kinda recall LKCD not dumping in monotonically
increasing order and my modifying your crash code to live with this
new feature in the LKCD dumps. I was trying to get the LKCD folks into
supporting crash in addition to lcrash but failed to get any support from
Tom Morano or Matt Robinson. I didn't realize that I had broken crash
with the zone changes and felt responsible to fix crash to deal with this
change that I had made. I also like the crash interface over the lcrash
interface. I proposed to Tom using the elf format like KEXEC uses but
he didn't go for it. I don't know why we can't hid additional crash info
into ELF files and maintain as much compatibility as possible.



 Specifically there is this
piece of save_offset() in lkcd_common.c:

       /* find the zone */
       for (ii=0; ii < lkcd->num_zones; ii++) {
               if (lkcd->zones[ii].start == zone) {
                       if (lkcd->zones[ii].pages[page].offset != 0) {
                          if (lkcd->zones[ii].pages[page].offset !=
off) {
                               error(INFO, "conflicting page: zone
%lld, "
                                       "page %lld: %lld, %lld !=
%lld\n",
                                       (unsigned long long)zone,
                                       (unsigned long long)page,
                                       (unsigned long long)paddr,
                                       (unsigned long long)off,
                                       (unsigned long long) \
                                          
lkcd->zones[ii].pages[page].offset);
                               abort();
                          }
                          ret = 0;
                       } else {
                          lkcd->zones[ii].pages[page].offset = off;
                          ret = 1;
                       }
                       break;
               }
       }

The printf looks a bit like my coding style, though I don't know
why (I ?)  decided to abort() in this case. I suppose the idea is
to look at the situation with gdb on the resulting core file.



The call to abort() above kills the crash session, which is both
annoying and unnecessary.

Isn't it worth while to look at the core file to understand the reason
for the abort() being called for?

I would think so, but not by me -- the developers of this LKCD
off-shoot can debug their own stuff.



I am seeing it in a customer dumpfile, who have their own dumping scheme
that is based upon LKCD version 7.  I understand that this may be a
problem with their LKCD port, but nonetheless, it's the only place in
the crash utility that doesn't recover gracefully from dumpfile access
errors.

Anyway, I would like to either:

1. change the error(INFO...) to error(FATAL...) so that run-time
   commands encountering this error will just fail, and the session
   will return to the crash> prompt, or
2. return 0, so that a "seek error" can be subsequently displayed
   by the readmem() command.

Number 2 is preferable, because it yields more clues as to where the
readmem() came from, but since I don't know much about the LKCD
physical memory zones stuff, is there any reason that shouldn't
be done?


How about having a crash debug flag and only calling abort if the
debug flag is set. You might print in the error message that the
user can force a core dump by adding a '-d' flag on invocation of
crash and sending you the core file.

Regardless of the reason behind it, the whole point is that there
was no need to abort the crash session.  If the "missing" page was
crucial to the crash session being able to run, then crash would
die on its own terms.  There are no other abort() calls in the
crash sources.

But in this case, the page was unnecessary for analysis of
the problem.  But when some commands (I forget which -- certainly
"search" for example) bumped into the page, the session would
abort() and had to be started up again.

Anyway, the abort() call was removed in version 4.0-4.9:

  - Fix for LKCD dumpfile access failures that abort() the crash session
    after displaying an error message indicating a problem with physical
    memory zones in the dumpfile.  Without the patch, the crash session
    would end immediately after displaying an error message of the sort:
    "conflicting page: zone 0, page 0: 0, 177160130 != 65536".  That
    error message will now only be displayed if the crash debug mode is 1
    or more, a readmem() "seek error" will be displayed instead, and the
    session will return to the "crash>" prompt.  (anderson@xxxxxxxxxx)

This was the patch:

--- lkcd_common.c       15 Nov 2007 15:44:38 -0000      1.29
+++ lkcd_common.c       19 Nov 2007 15:48:18 -0000      1.30
@@ -708,14 +708,15 @@
                if (lkcd->zones[ii].start == zone) {
                        if (lkcd->zones[ii].pages[page].offset != 0) {
                           if (lkcd->zones[ii].pages[page].offset != off) {
-                               error(INFO, "conflicting page: zone %lld, "
+                               if (CRASHDEBUG(1))
+                                   error(INFO, "LKCD: conflicting page: zone 
%lld, "
                                        "page %lld: %lld, %lld != %lld\n",
                                        (unsigned long long)zone,
                                        (unsigned long long)page,
                                        (unsigned long long)paddr,
                                        (unsigned long long)off,
                                        (unsigned long 
long)lkcd->zones[ii].pages[page].offset);
-                               abort();
+                               return -1;
                           }
                           ret = 0;
                        } else {

With respect to the -d flag suggestion, if you want to drop core
then you can set the internal crash "core" variable to "on", which
which will force a segmentation violation after printing the next
error message:

  crash> set core
  core: off (do NOT drop core on error message)
  crash> set core on
  core: on (drop core on error message)
  crash>

And then run the command that generates the error, say for
example, reading a non-existent physical address:

  crash> rd -p deadbeef
  [./crash] error trace: 8095503 => 8095799 => 8096ab4 => 808879c
  rd: read error: physical address: deadbeef  type: "32-bit PHYSADDR"

    808879c: __error+108
    8096ab4: readmem+1328
    8095799: display_memory+657
    8095503: cmd_rd+1558

  DROP_CORE flag set: forcing a segmentation fault
  Segmentation fault (core dumped)
  $



While I've got your attention. I'm upgrading our 2.6.12-stable kernel to
2.6.16-stable and want to start supporting core dumps. Ideally I'd like to
have core dumps that are compatible with gdb and crash. Can crash
handle the elf core files generated by KEXEC/KCORE. Last I thought
about this I recall there being incompatibilities and it getting worse
with kernels being compiled to be relocatable and kgdb having a problem
because it wasn't aware of the relocation.

By "KEXEC/KCORE" I'm presuming you mean "kexec/kdump", but I'm
not sure what incompatibility you're referring to?

Maybe the workaround for x86 kernels whose CONFIG_PHYSICAL_START
contains a value that is greater then CONFIG_PHYSICAL_ALIGN:

  http://people.redhat.com/anderson/crash.changelog.html#4_0_4_5

Or maybe you're talking about 32-bit gdb not being able to handle
kdump-generated 64-bit ELF core files for 32-bit kernels?

Dave


--
Crash-utility mailing list
Crash-utility@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/crash-utility