Hi, Dave On Wed, Aug 21, 2013 at 12:16 PM, Dave Anderson <anderson@xxxxxxxxxx> wrote: > > ----- Original Message ----- >> Hi, >> >> Not clear if it is a 3.11 issue or just general memory corruption. But >> I clearly cannot load slab information from any of my 3.11 dumps. Slab >> info contain incorrect pointer and "crash" just drops all slab >> information. > > Did all of your 3.11 dumps fail in a similar manner? Initially I saw the same issue at least on 3 different crashes. So I thought that it might be 3.11 specific. But now with a new dump that I got just now I do not see the "invalid kernel virtual address" message anymore. Instead when I do "kmem -S" I have following message: ====================================================== kmem: invalid structure member offset: kmem_cache_s_lists FILE: memory.c LINE: 8955 FUNCTION: do_slab_chain_percpu_v2() [/usr/local/google/home/anatol/sources/opensource/crash/crash] error trace: 493f1d => 47eca0 => 517642 => 460b22 CACHE NAME OBJSIZE ALLOCATED TOTAL SLABS SSIZE 460b22: OFFSET_verify.part.28+71 517642: OFFSET_verify+50 47eca0: do_slab_chain_percpu_v2+96 493f1d: dump_kmem_cache_percpu_v2+2205 kmem: invalid structure member offset: kmem_cache_s_lists FILE: memory.c LINE: 8955 FUNCTION: do_slab_chain_percpu_v2() ==================================================== I have no idea what does it mean. I'll try to find more time later this week and look at the problem deeper. > > When you run crash on the live system (on the system that crashed), do you > see the same "array cache limit" error message each time? > >> Crash expects that kmem_cache.array contains either valid pointer or >> NULL. And most slabs indeed have valid data. But there is also struct >> like below and crash says: >> >> crash: invalid kernel virtual address: 28 type: "array cache limit" >> >> Dave, do you have any pointer what it can be? > > I'm trying to understand if this is something new in the slab code, > or whether it is in fact memory corruption. That's why I'm interested > in whether you see the same thing on a live system. > >> If this is a memory corruption in one of the slab, would it be better >> if "crash" just skipped this slab (or marked it as 'fail to load') and >> show info for the rest of caches? > > I suppose that could be done, but there is no recording of the kmem_cache > slab list entries during intialization, so there's currently nothing that > could be "marked". But perhaps a list of "bad" slab caches could be created > by kmem_cache_init() which could be consulted later when "kmem -[sS]" cycle > through the slab list. > > Anyway, with respect to possible memory corruption: > >> struct kmem_cache { >> batchcount = 16, >> limit = 32, >> shared = 8, >> size = 64, >> reciprocal_buffer_size = 67108864, >> flags = 10240, >> num = 59, >> gfporder = 0, >> allocflags = 0, >> colour = 0, >> colour_off = 64, >> slabp_cache = 0x0 <irq_stack_union>, >> slab_size = 320, >> ctor = 0x0 <irq_stack_union>, >> name = 0xffff88065786ea60 "fib6_nodes", >> list = { >> next = 0xffff880654d9d618, >> prev = 0xffff880c50f24f98 >> }, >> refcount = 1, >> object_size = 48, >> align = 64, >> num_active = 5, >> num_allocations = 32, >> high_mark = 20, >> grown = 1, >> reaped = 0, >> errors = 0, >> max_freeable = 0, >> node_allocs = 0, >> node_frees = 0, >> node_overflow = 0, >> allochit = { >> counter = 3 >> }, >> allocmiss = { >> counter = 2 >> }, >> freehit = { >> counter = 0 >> }, >> freemiss = { >> counter = 0 >> }, >> obj_offset = 0, >> memcg_params = 0x0 <irq_stack_union>, >> node = 0xffff880654da37b0, >> array = {0xffff880654d68e00, 0xffff880654d69e00, 0xffff880654d6ae00, >> 0xffff880654da4e00, 0xffff880654da5e00, 0xffff880654da6e00, >> 0xffff880c50faee00, 0xffff880c50fb2e00, 0xffff880c51722e00, >> 0xffff880c51723e00, 0xffff880c4e83de00, 0xffff880c5172be00, >> 0xffff880654da7e00, 0xffff880654da8e00, 0xffff880654da9e00, >> 0xffff880654daae00, 0xffff880654dabe00, 0xffff880654dade00, >> 0xffff880c4e83ee00, 0xffff880c4e83fe00, 0xffff880c5172ae00, >> 0xffff880c51724e00, 0xffff880c50e69e00, 0xffff880c510b3e00, >> 0xffff880654e07a80, 0xffff880c4e903940, 0x24 <irq_stack_union+36>, >> 0xc000100003c1b, 0xe7e0 <ftrace_stack+2016>, 0x19 >> <irq_stack_union+25>, 0x1000200003c2a, 0x54670, 0x1fe >> <irq_stack_union+510>, 0xc000100003c48, 0xe8d7 <ftrace_stack+2263>, >> 0xf <irq_stack_union+15>, 0xc000100003c57, 0xe8c0 <ftrace_stack+2240>, >> 0x17 <irq_stack_union+23>, 0x1000200003c66, 0x54870, 0x2a3 >> <irq_stack_union+675>, 0xc000100003c79, 0xe8a0 <ftrace_stack+2208>, >> 0x13 <irq_stack_union+19>, 0x19000100003c88, 0x0 <irq_stack_union>, >> 0x28 <irq_stack_union+40>, 0x1000200003c99, 0x54b20, 0x917 >> <irq_stack_union+2327>, 0xc000100003cab, 0xe8f0 <ftrace_stack+2288>, >> 0x16 <irq_stack_union+22>, 0x1000200003cba, 0x55440, 0x30 >> <irq_stack_union+48>, 0x1000200003cce, 0x55470, 0x2bb >> <irq_stack_union+699>, 0xc000100003ce5, 0xe930 <ftrace_stack+2352>, >> 0x17 <irq_stack_union+23>, 0x1000200003cf4, 0x55730, 0x38 >> <irq_stack_union+56>, 0x1000200003d0a, 0x55770} >> } >> >> "crash" fails on array[26] that has value "0x24 <irq_stack_union+36>" > > The most recent CONFIG_SLAB kernel I have is 3.6-era, but there should > be at least one array_cache pointer for each cpu. The output above > shows 26 legitimate-looking pointers, and then a bunch of nonsensical > data. How many cpus does the system have? > > You did say that some of the slab caches have valid data. So if you > bring the system up with "crash -d3 ...", you will see the CRASHDEBUG(3) > output from this part of max_cpudata_limit(): > > if (CRASHDEBUG(3)) > fprintf(fp, "kmem_cache: %lx\n", cache); > > if (!readmem(cache+OFFSET(kmem_cache_s_array), > KVADDR, &cpudata[0], > sizeof(ulong) * ARRAY_LENGTH(kmem_cache_s_array), > "array cache array", RETURN_ON_ERROR)) > goto bail_out; > > for (i = max_limit = 0; (i < kt->cpus) && cpudata[i]; i++) { > if (!readmem(cpudata[i]+OFFSET(array_cache_limit), > KVADDR, &limit, sizeof(int), > "array cache limit", RETURN_ON_ERROR)) > goto bail_out; > if (CRASHDEBUG(3)) > fprintf(fp, " array limit[%d]: %d\n", i, limit); > if (limit > max_limit) > max_limit = limit; > } > > which on a Linux 3.6-era, 4-cpu dumpfile I have on hand, while cycling through > the kmem_cache slab list during initialization, it looks like this: > > $ crash vmlinux vmcore > ... [ cut ] ... > ^Mplease wait... (gathering kmem slab cache data)kmem_cache_downsize: 32896 to 160 > kmem_cache: ffff88007a076540 > array limit[0]: 120 > array limit[1]: 120 > array limit[2]: 120 > array limit[3]: 120 > shared node limit[0]: 480 > kmem_cache: ffff88007a076480 > array limit[0]: 54 > array limit[1]: 54 > array limit[2]: 54 > array limit[3]: 54 > shared node limit[0]: 216 > kmem_cache: ffff88007a076780 > array limit[0]: 120 > array limit[1]: 120 > array limit[2]: 120 > array limit[3]: 120 > shared node limit[0]: 480 > ... > > Do you see one or more caches that are OK, and then the one > that generates the "array cache limit" read error? Or does > the very first one fail? > > Anyway, presuming that's it's not a problem with all slabs, for now you > could try having max_cpudata_limit() just return whatever the "max_limit" > is for that slab cache, i.e., something like this: > > > for (i = max_limit = 0; (i < kt->cpus) && cpudata[i]; i++) { > if (!readmem(cpudata[i]+OFFSET(array_cache_limit), > KVADDR, &limit, sizeof(int), > "array cache limit", RETURN_ON_ERROR)) > - goto bail_out; > + return max_limit; > if (CRASHDEBUG(3)) > fprintf(fp, " array limit[%d]: %d\n", i, limit); > if (limit > max_limit) > max_limit = limit; > } > > Later on, if you run "kmem -[sS]", it will presumably go off into > the weeds when it tries to walk through the suspect kmem_cache(s). > And if so, you can then use the "kmem -I" option in conjunction with > "kmem -[sS]" to ignore the suspect cache(s). > > Dave > > -- > Crash-utility mailing list > Crash-utility@xxxxxxxxxx > https://www.redhat.com/mailman/listinfo/crash-utility -- Crash-utility mailing list Crash-utility@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/crash-utility