----- Original Message ----- > Hi, > > Not clear if it is a 3.11 issue or just general memory corruption. But > I clearly cannot load slab information from any of my 3.11 dumps. Slab > info contain incorrect pointer and "crash" just drops all slab > information. Did all of your 3.11 dumps fail in a similar manner? When you run crash on the live system (on the system that crashed), do you see the same "array cache limit" error message each time? > Crash expects that kmem_cache.array contains either valid pointer or > NULL. And most slabs indeed have valid data. But there is also struct > like below and crash says: > > crash: invalid kernel virtual address: 28 type: "array cache limit" > > Dave, do you have any pointer what it can be? I'm trying to understand if this is something new in the slab code, or whether it is in fact memory corruption. That's why I'm interested in whether you see the same thing on a live system. > If this is a memory corruption in one of the slab, would it be better > if "crash" just skipped this slab (or marked it as 'fail to load') and > show info for the rest of caches? I suppose that could be done, but there is no recording of the kmem_cache slab list entries during intialization, so there's currently nothing that could be "marked". But perhaps a list of "bad" slab caches could be created by kmem_cache_init() which could be consulted later when "kmem -[sS]" cycle through the slab list. Anyway, with respect to possible memory corruption: > struct kmem_cache { > batchcount = 16, > limit = 32, > shared = 8, > size = 64, > reciprocal_buffer_size = 67108864, > flags = 10240, > num = 59, > gfporder = 0, > allocflags = 0, > colour = 0, > colour_off = 64, > slabp_cache = 0x0 <irq_stack_union>, > slab_size = 320, > ctor = 0x0 <irq_stack_union>, > name = 0xffff88065786ea60 "fib6_nodes", > list = { > next = 0xffff880654d9d618, > prev = 0xffff880c50f24f98 > }, > refcount = 1, > object_size = 48, > align = 64, > num_active = 5, > num_allocations = 32, > high_mark = 20, > grown = 1, > reaped = 0, > errors = 0, > max_freeable = 0, > node_allocs = 0, > node_frees = 0, > node_overflow = 0, > allochit = { > counter = 3 > }, > allocmiss = { > counter = 2 > }, > freehit = { > counter = 0 > }, > freemiss = { > counter = 0 > }, > obj_offset = 0, > memcg_params = 0x0 <irq_stack_union>, > node = 0xffff880654da37b0, > array = {0xffff880654d68e00, 0xffff880654d69e00, 0xffff880654d6ae00, > 0xffff880654da4e00, 0xffff880654da5e00, 0xffff880654da6e00, > 0xffff880c50faee00, 0xffff880c50fb2e00, 0xffff880c51722e00, > 0xffff880c51723e00, 0xffff880c4e83de00, 0xffff880c5172be00, > 0xffff880654da7e00, 0xffff880654da8e00, 0xffff880654da9e00, > 0xffff880654daae00, 0xffff880654dabe00, 0xffff880654dade00, > 0xffff880c4e83ee00, 0xffff880c4e83fe00, 0xffff880c5172ae00, > 0xffff880c51724e00, 0xffff880c50e69e00, 0xffff880c510b3e00, > 0xffff880654e07a80, 0xffff880c4e903940, 0x24 <irq_stack_union+36>, > 0xc000100003c1b, 0xe7e0 <ftrace_stack+2016>, 0x19 > <irq_stack_union+25>, 0x1000200003c2a, 0x54670, 0x1fe > <irq_stack_union+510>, 0xc000100003c48, 0xe8d7 <ftrace_stack+2263>, > 0xf <irq_stack_union+15>, 0xc000100003c57, 0xe8c0 <ftrace_stack+2240>, > 0x17 <irq_stack_union+23>, 0x1000200003c66, 0x54870, 0x2a3 > <irq_stack_union+675>, 0xc000100003c79, 0xe8a0 <ftrace_stack+2208>, > 0x13 <irq_stack_union+19>, 0x19000100003c88, 0x0 <irq_stack_union>, > 0x28 <irq_stack_union+40>, 0x1000200003c99, 0x54b20, 0x917 > <irq_stack_union+2327>, 0xc000100003cab, 0xe8f0 <ftrace_stack+2288>, > 0x16 <irq_stack_union+22>, 0x1000200003cba, 0x55440, 0x30 > <irq_stack_union+48>, 0x1000200003cce, 0x55470, 0x2bb > <irq_stack_union+699>, 0xc000100003ce5, 0xe930 <ftrace_stack+2352>, > 0x17 <irq_stack_union+23>, 0x1000200003cf4, 0x55730, 0x38 > <irq_stack_union+56>, 0x1000200003d0a, 0x55770} > } > > "crash" fails on array[26] that has value "0x24 <irq_stack_union+36>" The most recent CONFIG_SLAB kernel I have is 3.6-era, but there should be at least one array_cache pointer for each cpu. The output above shows 26 legitimate-looking pointers, and then a bunch of nonsensical data. How many cpus does the system have? You did say that some of the slab caches have valid data. So if you bring the system up with "crash -d3 ...", you will see the CRASHDEBUG(3) output from this part of max_cpudata_limit(): if (CRASHDEBUG(3)) fprintf(fp, "kmem_cache: %lx\n", cache); if (!readmem(cache+OFFSET(kmem_cache_s_array), KVADDR, &cpudata[0], sizeof(ulong) * ARRAY_LENGTH(kmem_cache_s_array), "array cache array", RETURN_ON_ERROR)) goto bail_out; for (i = max_limit = 0; (i < kt->cpus) && cpudata[i]; i++) { if (!readmem(cpudata[i]+OFFSET(array_cache_limit), KVADDR, &limit, sizeof(int), "array cache limit", RETURN_ON_ERROR)) goto bail_out; if (CRASHDEBUG(3)) fprintf(fp, " array limit[%d]: %d\n", i, limit); if (limit > max_limit) max_limit = limit; } which on a Linux 3.6-era, 4-cpu dumpfile I have on hand, while cycling through the kmem_cache slab list during initialization, it looks like this: $ crash vmlinux vmcore ... [ cut ] ... ^Mplease wait... (gathering kmem slab cache data)kmem_cache_downsize: 32896 to 160 kmem_cache: ffff88007a076540 array limit[0]: 120 array limit[1]: 120 array limit[2]: 120 array limit[3]: 120 shared node limit[0]: 480 kmem_cache: ffff88007a076480 array limit[0]: 54 array limit[1]: 54 array limit[2]: 54 array limit[3]: 54 shared node limit[0]: 216 kmem_cache: ffff88007a076780 array limit[0]: 120 array limit[1]: 120 array limit[2]: 120 array limit[3]: 120 shared node limit[0]: 480 ... Do you see one or more caches that are OK, and then the one that generates the "array cache limit" read error? Or does the very first one fail? Anyway, presuming that's it's not a problem with all slabs, for now you could try having max_cpudata_limit() just return whatever the "max_limit" is for that slab cache, i.e., something like this: for (i = max_limit = 0; (i < kt->cpus) && cpudata[i]; i++) { if (!readmem(cpudata[i]+OFFSET(array_cache_limit), KVADDR, &limit, sizeof(int), "array cache limit", RETURN_ON_ERROR)) - goto bail_out; + return max_limit; if (CRASHDEBUG(3)) fprintf(fp, " array limit[%d]: %d\n", i, limit); if (limit > max_limit) max_limit = limit; } Later on, if you run "kmem -[sS]", it will presumably go off into the weeds when it tries to walk through the suspect kmem_cache(s). And if so, you can then use the "kmem -I" option in conjunction with "kmem -[sS]" to ignore the suspect cache(s). Dave -- Crash-utility mailing list Crash-utility@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/crash-utility