Re: Fw: [Bugme-new] [Bug 5566] New: scsi_eh_x/scsi_wq_x "zombie" processes in kernel 2.6.13+

Andrew Vasquez <andrew.vasquez@xxxxxxxxxx> · Wed, 23 Nov 2005 15:47:52 -0800

On Fri, 11 Nov 2005, Andrew Vasquez wrote:

> On Fri, 11 Nov 2005, Andrew Morton wrote:
> 
> > Begin forwarded message:
> > 
> > Date: Mon, 7 Nov 2005 14:49:17 -0800
> > From: bugme-daemon@xxxxxxxxxxxxxxxxxxx
> > To: bugme-new@xxxxxxxxxxxxxx
> > Subject: [Bugme-new] [Bug 5566] New: scsi_eh_x/scsi_wq_x "zombie" processes in kernel 2.6.13+
> > 
> > 
> > http://bugzilla.kernel.org/show_bug.cgi?id=5566
> > 
> >            Summary: scsi_eh_x/scsi_wq_x "zombie" processes in kernel 2.6.13+
> >     Kernel Version: 2.6.13+
> >             Status: NEW
> >           Severity: normal
> >              Owner: andrew.vasquez@xxxxxxxxxx
> >          Submitter: gator@xxxxxxxxxxxxxxx
> > 
> > 
> > Most recent kernel where this bug did not occur: 2.6.12
> > Starting around kernel version 2.6.13, the scsi_eh_x and scsi_wq_x
> > processes that are created per scsi host will not terminate if the
> > driver for the scsi interface is removed. I don't know whether there
> > are any serious problems involved with this, but one thing that is
> > definitely annoying, is that the process list fills very quickly when
> > modules are loaded/unloaded on demand, because 2 new processes will
> > be created every time the driver for a scsi adapter gets loaded.
> > 
> > (I guess, this happens with all scsi host modules - in my case, the
> > "culprit" is a qlogic fibre channel driver that gets loaded only when
> > needed.)
> 
> Seems there appear to be some reference-counting problems here, as the
> task trace:

There's definitely some ref-count problems with all fc_rport aware
drivers.  Basically, an rport->dev is not being torn-down completely
during fc_rport_terminate().  Unfortunately though, I'm going
cross-eyed following the acquisition/release model of the rport->dev
(so please be patient)...

After adding some (less than impressive) debugging codes to follow the
rport-dev tear-down process, I note that after the
transport_destroy_device() call in fc_rport_terminate(), the
rport->dev still maintains a single ref -- the patch below 'fixes' the
problem (and tear-down occurs as it should).  But, I'd still like to
understand 'why' it's needed...

During creation (fc_rport_create()), a reference to rport->dev is
taken during device_init(), another during transport_setup_device(),
two addition refs during device_add(), and another during
transport_add_device(). [side note: refcount is 5].

Several addition refs (4 to be exact) are acquired during
instantiation of the relevant scsi_target (and support) objects.

Now during teardown, the proper number of refs are released during
scsi_remove_target() (via fc_rport_tgt_remove()).  Tear-down continues
with transport_remove_device() [refcount is now 4], then device_del()
[refcount is now 2], and finally transport_destory_device() [refcount
is now 1].

At this point the rport is dropped from its peer list, and the
shost_gendev reference (acquired during fc_rport_create()) is dropped.
Unfortunately, rport->dev is still left dangling.

I've skimmed through similar transport-class tear-down code for some
hints, but am still left wondering why...

James B., James S. -- any ideas.  I know I must be missing something
basic -- please set me straight...

Thanks,
AV

---

diff --git a/drivers/scsi/scsi_transport_fc.c b/drivers/scsi/scsi_transport_fc.c
index 6cd5931..d9f17fe 100644
--- a/drivers/scsi/scsi_transport_fc.c
+++ b/drivers/scsi/scsi_transport_fc.c
@@ -1570,6 +1570,8 @@ fc_rport_terminate(struct fc_rport  *rpo
 	list_del(&rport->peers);
 	spin_unlock_irqrestore(shost->host_lock, flags);
 	put_device(&shost->shost_gendev);
+
+	put_device(dev);
 }
 
 /**
-
: send the line "unsubscribe linux-scsi" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html