On Thu, 2008-02-07 at 14:40 -0500, Dave Anderson wrote: > Andrew Hecox wrote: > > On Thu, 2008-02-07 at 11:27 -0500, Dave Anderson wrote: > >> Andrew Hecox wrote: > >>> On Thu, 2008-02-07 at 10:32 -0500, Dave Anderson wrote: > >>>> Andrew Hecox wrote: > >>>>> hello, > >>>>> > >>>>> I'm looking at a customer issue where diskdumpmsg is unable to read a > >>>>> vmcore file. It is not clear if this a problem with the vmcore file or > >>>>> diskdumpmsg. I can load the vmcore with crash and in my naive usage of > >>>>> it, can see no problems. However, I'm new to the tool so that doesn't > >>>>> give me a lot of confidence. > >>>>> > >>>>> Does anyone have any suggestions on how or if I can use crash to help > >>>>> determine if there's corruption in the vmcore file? Or any other way of > >>>>> approaching the problem? > >>>>> > >>>>> Thanks much, > >>>>> > >>>>> Andrew > >>>>> > >>>> I'm not sure what you expect the crash utility to do -- if it comes > >>>> up to a prompt with no error or warning messages, it means that the > >>>> ELF header contains what appears to be valid usable information, > >>>> and that the minimum kernel memory contents required to set up the > >>>> crash utility's notion of the running system are all in place. That's > >>>> not to say that there is no chance that the vmcore contains some > >>>> corruption that was not recognized. > >>>> > >>> Thanks. Any other suggestions on how to determine if a vmcore is "valid" > >>> or is that not even a reasonable question to try and ask? The problem > >>> I'm trying to solve is described better below: > >>> > >>>> With respect to diskdumpmsg, as I understand it, it was fairly recently > >>>> changed from a perl script to a C file so that it could be run > >>>> earlier in time so as to be able to use the swap partition. Looking > >>>> at main() in the diskdumpmsg.c file (version 1.4.1-2), there are numerous > >>>> error types and associated error messages. What do you mean when you > >>>> say that "diskdumpmsg is unable to read a vmcore file"? > >>> Specifically: > >>> > >>> - user reported a floating point exception from diskdump on startup > >>> - the result was reproducible locally but only with their vmcore file > >>> - fpe occurred in get_logbuf: > >>> log_end %= log_buf_len; > >>> - log_buf_len had been set to 0 in read_buffer > >>> if (!page_is_dumpable(pfn, dump->device)) { > >>> memset(buf, 0, copy_len); > >>> } else { > >>> - I don't know enough to say if the page really wasn't dumpable. > >>> static inline bool page_is_dumpable(unsigned int nr, DumpDevice *device) > >>> { > >>> return device->dumpable_bitmap[nr>>3] & (1 << (nr & 7)); > >>> } > >>> - I wrote a patch with one way to avoid the FPE (attached) and sent it > >>> to SEG. > >>> > >>> Now I'm trying to determine if the vmcore file should be readable by > >>> diskdumpmsg. In other words, is this a problem in diskdumpmsg post-crash > >>> or a problem with the vmcore file prior to it getting to diskdumpmsg. > >>> Unfortunately, I don't understand the problem domain very well at all, > >>> hence the probably naive questions :) > >>> > >>> Any suggestions are appreciated. > >>> > >>> -Andrew > >> So it appears that the page containing the log_buf_len symbol is not > >> readable or contained in the dumpfile. BTW, is this a compressed > >> dumpfile or an ELF formatted dumpfile? And what "dump_level" did > >> they configure? > >> > > > > compressed, level is 19. > > > >> Anyway, back to the log_buf_len symbol read, what happens when you > >> enter the "log" command while in a crash session? It attempts to > >> read that symbol immediately. > >> > > > > I get what appears to be a full and valid dump of the kernel message > > buffer. > > > > The crash utility has the same page_is_dumpable() function, which I presume > looks at precisely the same bitmap data from the dumpfile. And that > must be working, given that the "log" command works as expected. > > One difference is that diskdumpmsg uses /boot/System.map-<release> for > the symbol values, whereas crash uses the vmlinux file. It might be > of interest to determine whether the value of "log_buf_len" used by > diskdumpmsg is the same symbol value as used by crash. > I get the same: (/boot/System.map-2.6.9-67.0.1.ELhugemem) 02323bd8 d log_buf_len (/usr/lib/debug/lib/modules/2.6.9-67.0.1.ELhugemem/vmlinux) $1 = (int *) 0x2323bd8 -Andrew > Dave > > -- Crash-utility mailing list Crash-utility@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/crash-utility