(Was Re: mount cmd crashes crash) On Thu, 2010-08-19 at 12:45 +0000, Dave Anderson wrote: > ----- "Bob Montgomery" <bob.montgomery@xxxxxx> wrote: > > > Yeah, it's not important to use the context of pid 1, but it just needs > > > some context, and I had presumed that init would always exist. I thought > > > that the panic("Attempted to kill the idle task!") in do_exit() would > > > prevent pid 1 from ever going away -- but apparently your kernel figured > > > out how to do it elsewhere... ;-) > > > > That test is for PID 0, not PID 1 (at least on the kernel I'm > > debugging.) However, there is this also: > > > > if (unlikely(tsk == child_reaper)) > > panic("Attempted to kill init!"); > > That's the one I *meant*... ;-) > > > > > And child_reaper in the dump points to a task struct for init that isn't > > in the ps listing. Hmmm. Maybe that part *is* interesting in this dump... Well, I've been picking at this some more. PID 1 is in the system, but crash misses it when it's building its table of tasks in refresh_hlist_task_table_v2(). In fact, on my particular dump, it loses track of at least 3 processes. The attached patch changes that behavior. It has to do with collisions on the pid_hash table where an early item on the chain has a NULL task pointer which causes the code to ignore subsequent items on that collision chain. I'm not sure what it means when the tasks[0].first pointer in the struct pid is NULL, but that's what triggers the problem and keeps crash from following the pid_chain pointer to the next struct pid. I am not confident that this whole area is correct yet, just closer to correct than it was. These now appear in the ps output: crash-5.0.6-fix2> ps 1 8144 998 PID PPID CPU TASK ST %MEM VSZ RSS COMM 1 0 1 ffff81012bd3c780 IN 0.0 6124 688 init 8144 6257 0 ffff81011996e140 RU 0.7 108876 35016 mirrorclient 998 11 0 ffff81012a9cd780 IN 0.0 0 0 [fc_dl_1] where before: crash-5.0.6-fix> ps 1 8144 998 ps: invalid task or pid value: 1 ps: invalid task or pid value: 8144 ps: invalid task or pid value: 998 This might have been some transition behavior of the pid hash design in the kernel, because I've got two dumps based on 2.6.18 kernels that show missing processes (this one had 3 out of 532, the other had 1 out of 146), but my new patched crash doesn't reveal any missing processes in 2.6.29 and newer dumps (I checked 4 dumps, with process counts ranging from 362 to 926). Only my recent 2.6.18 dump was lucky enough to be missing PID 1, with me being lucky enough to try crash's mount command, or we'd still not know about it :-) The patch is simple, but has lots of lines because I moved the indent. Bob Montgomery Working at HP
--- task.c.orig 2010-08-25 15:38:12.000000000 -0600 +++ task.c 2010-08-25 15:45:37.000000000 -0600 @@ -1747,30 +1747,32 @@ retry_pid_hash: console("pid_hash[%d]: %lx task: %lx (node: %lx) next: %lx pprev: %lx\n", i, pid_hash[i], next, kpp, pnext, pprev); - while (next) { - if (!IS_TASK_ADDR(next)) { - error(INFO, - "%sinvalid task address in pid_hash: %lx\n", - DUMPFILE() ? "\n" : "", next); - if (DUMPFILE()) - break; - hq_close(); - retries++; - goto retry_pid_hash; + while (1) { + if (next) { + if (!IS_TASK_ADDR(next)) { + error(INFO, + "%sinvalid task address in pid_hash: %lx\n", + DUMPFILE() ? "\n" : "", next); + if (DUMPFILE()) + break; + hq_close(); + retries++; + goto retry_pid_hash; - } + } - if (!is_idle_thread(next) && !hq_enter(next)) { - error(INFO, - "%sduplicate task in pid_hash: %lx\n", - DUMPFILE() ? "\n" : "", next); - if (DUMPFILE()) - break; - hq_close(); - retries++; - goto retry_pid_hash; - } + if (!is_idle_thread(next) && !hq_enter(next)) { + error(INFO, + "%sduplicate task in pid_hash: %lx\n", + DUMPFILE() ? "\n" : "", next); + if (DUMPFILE()) + break; + hq_close(); + retries++; + goto retry_pid_hash; + } + } cnt++; if (!pnext)
-- Crash-utility mailing list Crash-utility@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/crash-utility