Missing PID 1 is crash problem with losing tasks

Bob Montgomery <bob.montgomery@xxxxxx> · Wed, 25 Aug 2010 17:23:32 -0600

(Was Re:  mount cmd crashes crash)

On Thu, 2010-08-19 at 12:45 +0000, Dave Anderson wrote:
> ----- "Bob Montgomery" <bob.montgomery@xxxxxx> wrote:

> > > Yeah, it's not important to use the context of pid 1, but it just needs
> > > some context, and I had presumed that init would always exist.  I thought
> > > that the panic("Attempted to kill the idle task!") in do_exit() would
> > > prevent pid 1 from ever going away -- but apparently your kernel figured
> > > out how to do it elsewhere...  ;-)
> > 
> > That test is for PID 0, not PID 1 (at least on the kernel I'm
> > debugging.)  However, there is this also:
> > 
> >         if (unlikely(tsk == child_reaper))
> >                 panic("Attempted to kill init!");
> 
> That's the one I *meant*...   ;-)
> 
> > 
> > And child_reaper in the dump points to a task struct for init that isn't
> > in the ps listing.  Hmmm.  Maybe that part *is* interesting in this dump...

Well, I've been picking at this some more.  PID 1 is in the system, but
crash misses it when it's building its table of tasks in
refresh_hlist_task_table_v2().  In fact, on my particular dump, it loses
track of at least 3 processes. 

The attached patch changes that behavior.  It has to do with collisions
on the pid_hash table where an early item on the chain has a NULL task
pointer which causes the code to ignore subsequent items on that
collision chain.  I'm not sure what it means when the tasks[0].first
pointer in the struct pid is NULL, but that's what triggers the problem
and keeps crash from following the pid_chain pointer to the next struct
pid.  I am not confident that this whole area is correct yet, just
closer to correct than it was. 

These now appear in the ps output:

crash-5.0.6-fix2> ps 1 8144 998
   PID    PPID  CPU       TASK        ST  %MEM     VSZ    RSS  COMM
      1      0   1  ffff81012bd3c780  IN   0.0    6124    688  init
   8144   6257   0  ffff81011996e140  RU   0.7  108876  35016  mirrorclient
    998     11   0  ffff81012a9cd780  IN   0.0       0      0  [fc_dl_1]

where before:

crash-5.0.6-fix> ps 1 8144 998
ps: invalid task or pid value: 1

ps: invalid task or pid value: 8144

ps: invalid task or pid value: 998

This might have been some transition behavior of the pid hash design in
the kernel, because I've got two dumps based on 2.6.18 kernels that show
missing processes (this one had 3 out of 532, the other had 1 out of
146), but my new patched crash doesn't reveal any missing processes in
2.6.29 and newer dumps (I checked 4 dumps, with process counts ranging
from 362 to 926).  Only my recent 2.6.18 dump was lucky enough to be
missing PID 1, with me being lucky enough to try crash's mount command,
or we'd still not know about it :-)

The patch is simple, but has lots of lines because I moved the indent.

Bob Montgomery
Working at HP

--- task.c.orig	2010-08-25 15:38:12.000000000 -0600
+++ task.c	2010-08-25 15:45:37.000000000 -0600
@@ -1747,30 +1747,32 @@ retry_pid_hash:
 			console("pid_hash[%d]: %lx task: %lx (node: %lx) next: %lx pprev: %lx\n",
 				i, pid_hash[i], next, kpp, pnext, pprev);
 
-		while (next) {
-                        if (!IS_TASK_ADDR(next)) {
-                                error(INFO,
-                                    "%sinvalid task address in pid_hash: %lx\n",
-                                        DUMPFILE() ? "\n" : "", next);
-                                if (DUMPFILE())
-                                        break;
-                                hq_close();
-                                retries++;
-                                goto retry_pid_hash;
+		while (1) {
+			if (next) {
+                        	if (!IS_TASK_ADDR(next)) {
+                                	error(INFO,
+                                    	"%sinvalid task address in pid_hash: %lx\n",
+                                        	DUMPFILE() ? "\n" : "", next);
+                                	if (DUMPFILE())
+                                        	break;
+                                	hq_close();
+                                	retries++;
+                                	goto retry_pid_hash;
 
-                        }
+                        	}
 
-                        if (!is_idle_thread(next) && !hq_enter(next)) {
-                                error(INFO,
-                                    "%sduplicate task in pid_hash: %lx\n",
-                                        DUMPFILE() ? "\n" : "", next);
-                                if (DUMPFILE())
-                                        break;
-                                hq_close();
-                                retries++;
-                                goto retry_pid_hash;
-                        }
+                        	if (!is_idle_thread(next) && !hq_enter(next)) {
+                                	error(INFO,
+                                    	"%sduplicate task in pid_hash: %lx\n",
+                                        	DUMPFILE() ? "\n" : "", next);
+                                	if (DUMPFILE())
+                                        	break;
+                                	hq_close();
+                                	retries++;
+                                	goto retry_pid_hash;
+                        	}
 
+			}
                         cnt++;
 
 			if (!pnext) 
--
Crash-utility mailing list
Crash-utility@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/crash-utility