Re: crash aborts with cannot determine idle task

Dave Anderson <anderson@xxxxxxxxxx> · Wed, 02 Apr 2008 12:00:44 -0400

Chandru wrote:

Look at the crash function get_idle_threads() in task.c, which is where
you're failing.  It runs through the history of the symbols that Linux
has used over the years for the run queues.  For the most recent kernels,
it looks for the "per_cpu__runqueues" symbol.  At least on 2.6.25-rc2,
the kernel still defines them in kernel/sched.c like this:

  static DEFINE_PER_CPU_SHARED_ALIGNED(struct rq, runqueues);

So if you do an "nm -Bn vmlinux | grep runqueues", you should see:

  # nm -Bn vmlinux-2.6.25-rc1-ext4-1 | grep runqueues
  ffffffff8082b700 d per_cpu__runqueues
  #

I'm guessing that's not the problem -- so presuming that the symbol 
*does*
exist, find out why it's failing to increment "cnt" in this part of
get_idle_threads():

       if (symbol_exists("per_cpu__runqueues") &&
            VALID_MEMBER(runqueue_idle)) {
                runqbuf = GETBUF(SIZE(runqueue));
                for (i = 0; i < nr_cpus; i++) {
                        if ((kt->flags & SMP) && (kt->flags & 
PER_CPU_OFF)) {
                                runq = 
symbol_value("per_cpu__runqueues") +
                                        kt->__per_cpu_offset[i];
                        } else
                                runq = 
symbol_value("per_cpu__runqueues");

                        readmem(runq, KVADDR, runqbuf,
                                SIZE(runqueue), "runqueues entry 
(per_cpu)",
                                FAULT_ON_ERROR);
                        tasklist[i] = ULONG(runqbuf + 
OFFSET(runqueue_idle));
                        if (IS_KVADDR(tasklist[i]))
                                cnt++;
                }
        }

Determine whether it even makes it to the inner for loop, whether
the pre-determined nr_cpus value makes sense, whether the SMP flag
reflects whether the kernel was compiled for SMP, whether the PER_CPU_OFF
flag was set, what address was calculated, etc...

Dave

Thanks for the reply Dave.  The code makes it to the inner for loop and 
the condition
if (IS_KVADDR(tasklist[i]))  fails which is why 'cnt' doesn't get 
incremented. The tasklist[i] somewhat has this value : 0x3d60657870722024.

I ran gdb on the vmcore file and printed the memory contents .

(gdb) print per_cpu__runqueues
$1 = {lock = {raw_lock = {slock = 1431524419}}, nr_running = 
5283422954284598606,
 raw_weighted_load = 5064663116585906736, cpu_load = 
{2316051155752670036, 5929356451801411872,
   2613857225664584019}, nr_switches = 5644502509443686462,
 nr_uninterruptible = 2316072106569976142, expired_timestamp = 
5142904381182533935,
 timestamp_last_tick = 7235439831918129227, curr = 0x5f66696c650a5243, 
idle = 0x3d60657870722024, <<<-----
 prev_mm = 0x5243202b20243f60, active = 0xa247b4155535443, expired = 
0x5352434449527d2f,

Does this mean that the kernel data was corrupted when vmcore was 
collected ?.

I don't know.

You cannot expect gdb to be able to handle it at all, unless
the kernel was configured without CONFIG_SMP.  In that case,
the per_cpu__runqueues symbol points to the singular instance
of an rq.

However, more likely your kernel is configured with CONFIG_SMP.
In that case, a per-cpu offset has to be applied to the symbol
value of per_cpu__runqueues to calculate where each cpu's instance
of its rq structure is located.  I can guarantee you that gdb
cannot do that, and that's probably why you're seeing "garbage"
data above.

So you can see that's what's happening in the get_idle_threads()
function where it's calculating the "runq" address each time
through the loop.  If the kernel is configured CONFIG_SMP,
it adds the per-cpu offset value, otherwise it uses the
symbol value of "per_cpu__runqueues" as is.

As I suggested before, you're going to have to determine why
the tasklist[i] is bogus.  The first things to determine are:

(1) what "nr_cpus" was calculated to be, and
(2) whether the SMP and PER_CPU_OFF flags are set in kt->flags.

If those variables/settings make sense, then presumably the
problem is in the determination of the per-cpu offset values.
That's done in a machine-specific way, so I can't help you
without knowing what architecture you're dealing with, not
to mention what kernel version, or whether it's configured
CONFIG_SMP or not, and whether you can run crash on the live
system that generated the dumpfile.

Dave

--
Crash-utility mailing list
Crash-utility@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/crash-utility