Worth, Kevin wrote:
Tried running crash on a running kernel... seems that 4.0-3.7 doesn't like my kernel. When I run crash 4.0-7.2 on a live system, it appears that it has no problems with vmalloc'd module memory. crash 4.0-3.7 ... GNU gdb 6.1 ... This GDB was configured as "i686-pc-linux-gnu"... crash: /boot/System.map-2.6.20-17.39-custom2 and /dev/mem do not match! Usage: crash [-h [opt]][-v][-s][-i file][-d num] [-S] [mapfile] [namelist] [dumpfile] Enter "crash -h" for details. crash 4.0-7.2 ... GNU gdb 6.1 ... This GDB was configured as "i686-pc-linux-gnu"... KERNEL: vmlinux-2.6.20-17.39-custom2 DUMPFILE: /dev/mem CPUS: 2 DATE: Wed Oct 1 16:31:39 2008 UPTIME: 04:57:53 LOAD AVERAGE: 0.10, 0.09, 0.09 TASKS: 95 NODENAME: ProCurve-TMS-zl-Module RELEASE: 2.6.20-17.39-custom2 VERSION: #3 SMP Wed Sep 24 10:11:03 PDT 2008 MACHINE: i686 (2200 Mhz) MEMORY: 5 GB PID: 15801 COMMAND: "crash" TASK: 47bd6030 [THREAD_INFO: 4a8a8000] CPU: 1 STATE: TASK_RUNNING (ACTIVE) crash> Since that seems ok (and I don't encounter the error) I'll run crash with -d7 on the dump file to hopefully expose what is wrong with either the dump or with crash. I've attached the output of crash with -d7... not sure how the mailing like handles file attachments, but if needed I can paste the text. (or if there is something specific I should look for let me know and I can paste just that section).
Yeah, crash 4.0-3.7 is 2 years old, which is pretty ancient. Plus I'm only interested in helping out with the latest version. But according to the above, 4.0-7.2 works OK on the live system? You can do a "mod" command and it works OK? Sometimes on larger-memory systems, running live using /dev/mem, you might see the "WARNING: cannot access vmalloc'd module" message because the physical memory that is backing the vmalloc'd virtual address is in highmem, and cannot be accessed by /dev/mem. In any case, it appears that the module structures have all been read successfully on your live system. And that's kind of bothersome, because for all practical purposes, the crash utility doesn't care where it's getting the physical memory from (i.e., from /dev/mem or from the dumpfile). And if it works on the live system, it should work with the dumpfile. Anyway, looking at the crash.log, here's what's happening: Everything was running fine until the module initialization step. The list of installed kernel modules is headed up from the "modules" list_head symbol at 403c63a4, which contains a pointer to the first module structure at vmalloc address f9088280: ... <readmem: 403c63a4, KVADDR, "modules", 4, (FOE), 83ff8cc> please wait... (gathering module symbol data) module: f9088280 The readmem() of that first module -- and the very first vmalloc address -- at f9088280 required a page table translation: <readmem: f9088280, KVADDR, "module struct", 1536, (ROE|Q), 842a5e0> <readmem: 4044b000, KVADDR, "pgd page", 32, (FOE), 845a308> <readmem: 6000, PHYSADDR, "pmd page", 4096, (FOE), 845b310> <readmem: 1d515000, PHYSADDR, "page table", 4096, (FOE), 845c318> That readmem() appears to have worked, because it thinks it successfully read the module struct at that address. But when it pulled out the address of the *next* module in the linked list, it read this: module: fffffffc And when it tried to read that bogus address, it failed, and led to the WARNING message: <readmem: fffffffc, KVADDR, "module struct", 1536, (ROE|Q), 842a5e0> <readmem: 7000, PHYSADDR, "page table", 4096, (FOE), 845c318> crash: invalid kernel virtual address: fffffffc type: "module struct" WARNING: cannot access vmalloc'd module memory ... Although I cannot say for sure, I'm presuming that the initial read of the module structure at f9088280 ended up reading from the wrong location and therefore read garbage. You can verify that by bringing the a dumpfile session, and doing this: crash> module f9088280 It *should* display something that is recognizable as a module structure. For example: crash> mod | grep ext3 f8899080 ext3 123593 (not loaded) [CONFIG_KALLSYMS] crash> module f8899080 struct module { state = MODULE_STATE_LIVE, list = { next = 0xf8854a84, prev = 0xf8876984 }, name = "ext3" mkobj = { kobj = { k_name = 0xf88990cc "ext3", name = "ext3", kref = { refcount = { counter = 2 } }, ... Your attempt will probably show the fffffffc in the list_head just after the "state" field at the top, as well as a bunch of other garbage. And as I suggested in my first reply, can you also verify that user virtual address translations also fail? I suggested pulling a sample virtual address out of the current context's ("bash") VM, but doing that may "look" like it's working, but it may be doing it incorrectly. So you also need to verify the data that it finds there. One way to do that is to read the beginning of the /bin/bash text segment, and look for "ELF" string. For example, here I'm in a "bash" context, similar to the context that your dumpfile comes up in by default: crash> set PID: 19839 COMMAND: "bash" TASK: f7b03000 [THREAD_INFO: def66000] CPU: 1 STATE: TASK_INTERRUPTIBLE crash> Dump the virtual memory regions, and find the first VMA that is backed by "/bin/bash": crash> vm PID: 19839 TASK: f7b03000 CPU: 1 COMMAND: "bash" MM PGD RSS TOTAL_VM f6dc5740 f745c9c0 1392k 4532k VMA START END FLAGS FILE f69019bc 6fa000 703000 75 /lib/libnss_files-2.5.so f69013e4 703000 704000 100071 /lib/libnss_files-2.5.so f6901d84 704000 705000 100073 /lib/libnss_files-2.5.so f6901284 a7c000 a96000 875 /lib/ld-2.5.so f6901b74 a96000 a97000 100871 /lib/ld-2.5.so f6901b1c a97000 a98000 100873 /lib/ld-2.5.so f69012dc a9a000 bd7000 75 /lib/libc-2.5.so f690185c bd7000 bd9000 100071 /lib/libc-2.5.so f6901ac4 bd9000 bda000 100073 /lib/libc-2.5.so f69017ac bda000 bdd000 100073 f6901e8c bdf000 be1000 75 /lib/libdl-2.5.so f6901a6c be1000 be2000 100071 /lib/libdl-2.5.so f6901754 be2000 be3000 100073 /lib/libdl-2.5.so f6901f94 c89000 c8c000 75 /lib/libtermcap.so.2.0.8 f69016fc c8c000 c8d000 100073 /lib/libtermcap.so.2.0.8 f6901d2c fd1000 fd2000 8000075 f6901124 8047000 80f5000 1875 /bin/bash f69018b4 80f5000 80fa000 101873 /bin/bash f6901964 80fa000 80ff000 100073 f690122c 9a75000 9a96000 100073 f680890c b7d7f000 b7f7f000 71 /usr/lib/locale/locale-archive f6901f3c b7f7f000 b7f81000 100073 f68cfb74 b7f82000 b7f84000 100073 f6dd69bc b7f84000 b7f8b000 d1 /usr/lib/gconv/gconv-modules.cache f69014ec bf86e000 bf884000 100173 crash> You can see above, that in my case the text region starts at user virtual address 8047000. That actually points to the ELF header at the beginning of the "/bin/bash" file, which starts with a 0x7f followed by the ascii "ELF" characters: crash> rd 8047000 8047000: 464c457f .ELF crash> You might want to use "rd -u <address>" to ensure that crash will presume that the address is a user address, just in case that's an issue with your setup. Anyway, try the above, and also dump out the and save the output of these debug commands: crash> help -m > help.k crash> help -k > help.m crash> help -v > help.v But again, given that you seem to be saying that everything works just fine on the live system, the debugging of this issue will most likely end up requiring that you determine where exactly things "go wrong" with the dumpfile in comparison to the same things working correctly on the live system. Thanks, Dave -- Crash-utility mailing list Crash-utility@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/crash-utility