It seems like adding a kernel thread to handle this would work... if we don't want to wait in sg_release or sg_remove, and we can't do the final scsi_device_put from sg_cmd_done. I don't know if it's really worth it though, for something that's probably never going to happen anyway. I think that in later kernels than the 2.6.9 kernel I'm on, scsi_device_dev_release can't sleep (not 100% certain though). So maybe the best solution would be to just change sg_remove so that it doesn't do the scsi_device_put if there's outstanding IO, and let that happen from sg_cmd_done. I don't know if there's anything that could go wrong with that approach... Nate > -----Original Message----- > From: Douglas Gilbert [mailto:dougg@xxxxxxxxxx] > Sent: Tuesday, October 18, 2005 12:55 AM > To: Dailey, Nate > Cc: linux-scsi@xxxxxxxxxxxxxxx > Subject: Re: requesting advice: oops when doing sg_dd IO and > device is removed > > > Nate, > I always suspected that if one really tried hard enough > then unexpectedly closing a file descriptor (e.g. via > control-C on an app) at the same time as there are > outstanding commands could cause problems. Obviously > you have succeeded demonstrating that with your experiment. > > When I fought with this problem in the lk 2.4 series, > someone skilled at finding flaws in locking logic > suggested there was no solution (at least the way > I was doing it) ... > > Waiting a close() call (i.e. sg_release() ), potentially > indefinitely, was very unpopular with application clients. > The sg driver did that in the old days, and I was abused > from several quarters (one name springs to mind). So I > tried to design sg_release() so it wouldn't ever hang > the application client. > > On reflection, I think to allow sg_release() to go > through quickly, another kernel thread is needed that > is passed ownership of unfinished commands. What do > you think? > > > Doug Gilbert > > > Dailey, Nate wrote: > > Doug (and anyone else who might be interested), I'm looking for some > > advice as to how best to fix a problem I hit in sg. > > > > I was doing the following: > > > > - start up several processes reading from an sg device with sg_dd > > - every other iteration, kill the processes > > - remove the device (echo 1 > /sys/bus/scsi/devices/1:0:1:0/delete) > > - add the device (echo "0 1 0" > /sys/class/scsi_host/host1/scan) > > > > This was on a 2.6.9 kernel, with a patch: > > > > http://marc.theaimsgroup.com/?l=linux-scsi&m=112749860222533&w=2 > > > > That I've submitted to the LSML... patch was intended to > fix a different > > oops when doing the above test. > > > > Here's what I saw, after the device was removed: > > > > Debug: sleeping function called from invalid context at > > include/linux/rwsem.h:6in_atomic():1[expected: 0], irqs_disabled():0 > > [<c011fcd1>] __might_sleep+0x7d/0x88 > > [<c021cb5e>] device_del+0x1e/0x90 > > [<f8840ccc>] scsi_device_dev_release+0xdf/0x13a [scsi_mod] > > [<c021c8e9>] device_release+0x11/0x40 > > [<c01bf03f>] kobject_cleanup+0x40/0x60 > > [<c01bf05f>] kobject_release+0x0/0x8 > > [<c01bf305>] kref_put+0x42/0x45 > > [<f883e61b>] scsi_next_command+0xc/0x14 [scsi_mod] > > [<f883a0a8>] scsi_release_request+0x8/0x10 [scsi_mod] > > [<f883ab15>] scsi_finish_command+0xad/0xb1 [scsi_mod] > > [<f883aa3a>] scsi_softirq+0xb6/0xbe [scsi_mod] > > [<c0126424>] __do_softirq+0x4c/0xb1 > > [<c010812f>] do_softirq+0x4f/0x56 > > ======================= > > [<c0107a44>] do_IRQ+0x1a2/0x1ae > > [<c02d1a8c>] common_interrupt+0x18/0x20 > > [<c0104018>] default_idle+0x0/0x2c > > [<c0104041>] default_idle+0x29/0x2c > > [<c010409d>] cpu_idle+0x26/0x3b > > [<c0390786>] start_kernel+0x199/0x19d > > > > (It looks like a command comes back, we drop the last > reference on the > > device, the device is freed... because we're in softirq, > and device_del > > may eventually sleep (in this kernel version, anyway), we > get this debug > > message. I don't think this should normally happen when a > command comes > > back, for reasons I'll discuss below) > > > > Continuing on... > > > > Badness in kref_get at lib/kref.c:32 > > [<c01bf2bb>] kref_get+0x24/0x2c > > [<c01beffb>] kobject_get+0xf/0x13 > > [<c021cb32>] get_device+0xe/0x14 > > [<f883ef34>] scsi_request_fn+0x19/0x319 [scsi_mod] > > [<c022213e>] blk_run_queue+0x20/0x2f > > [<f883a0a8>] scsi_release_request+0x8/0x10 [scsi_mod] > > [<f883ab15>] scsi_finish_command+0xad/0xb1 [scsi_mod] > > [<f883aa3a>] scsi_softirq+0xb6/0xbe [scsi_mod] > > [<c0126424>] __do_softirq+0x4c/0xb1 > > [<c010812f>] do_softirq+0x4f/0x56 > > ======================= > > [<c0107a44>] do_IRQ+0x1a2/0x1ae > > [<c02d1a8c>] common_interrupt+0x18/0x20 > > [<c0104018>] default_idle+0x0/0x2c > > [<c0104041>] default_idle+0x29/0x2c > > [<c010409d>] cpu_idle+0x26/0x3b > > [<c0390786>] start_kernel+0x199/0x19d > > > > (So, there were more commands on the device's queue... > presumably for > > sg, since I wasn't doing any other IO to this disk. We're > trying to get > > a reference on the device, but the ref count has already > gone to 0 so > > this "Badness" message results.) > > > > Continuing on... > > > > Unable to handle kernel NULL pointer dereference at virtual address > > 00000008 > > printing eip: > > c0229678 > > *pde = 00004001 > > Oops: 0000 [#1] > > SMP > > Modules linked in: sg(U) parport_pc lp parport autofs4 nfs > lockd sunrpc > > md5 > > ip) > > CPU: 0 > > EIP: 0060:[<c0229678>] Tainted: PF VLI > > EFLAGS: 00010046 (2.6.9-20.ELsmp) > > EIP is at cfq_next_request+0x7/0x35 > > eax: f7b6f5e0 ebx: 00000000 ecx: c03249bc edx: f243a594 > > esi: f7b6f5e0 edi: f7f5b000 ebp: f7b6f5e0 esp: c03c7f80 > > ds: 007b es: 007b ss: 0068 > > Process swapper (pid: 0, threadinfo=c03c7000 task=c031ea80) > > Stack: f7b6f5e0 f7b6f5e0 c02209dc f7b6f5e0 f243a400 > f7f5b000 f883ef51 > > f7b6f5e0 > > 00000246 f243a400 00000000 c022213e f7d62800 > f3872800 f883a0a8 > > f7f5b000 > > f883ab15 00002002 f3872800 c03c7fd4 f883aa3a > c03c7fd4 c03c7fd4 > > 00000003 > > Call Trace: > > [<c02209dc>] elv_next_request+0xbe/0xce > > [<f883ef51>] scsi_request_fn+0x36/0x319 [scsi_mod] > > [<c022213e>] blk_run_queue+0x20/0x2f > > [<f883a0a8>] scsi_release_request+0x8/0x10 [scsi_mod] > > [<f883ab15>] scsi_finish_command+0xad/0xb1 [scsi_mod] > > [<f883aa3a>] scsi_softirq+0xb6/0xbe [scsi_mod] > > [<c0126424>] __do_softirq+0x4c/0xb1 > > [<c010812f>] do_softirq+0x4f/0x56 > > ======================= > > [<c0107a44>] do_IRQ+0x1a2/0x1ae > > [<c02d1a8c>] common_interrupt+0x18/0x20 > > [<c0104018>] default_idle+0x0/0x2c > > [<c0104041>] default_idle+0x29/0x2c > > [<c010409d>] cpu_idle+0x26/0x3b > > [<c0390786>] start_kernel+0x199/0x19d > > Code: 01 00 00 00 89 e8 eb b6 8b 04 24 3b 43 24 0f 92 c0 31 > d2 85 ff 0f > > 95 c2 > > > > (It was inevitable that something bad would happen when > trying to run > > the queue on a device which had been freed) > > > > > > My theory here is that sg should be holding a reference on the SCSI > > device until all IO comes back. With or without the patch I > mentioned > > above, this is not the case. Although sg_release does take > out an extra > > reference if there's any IO still outstanding, if sg_remove > is called, > > this reference is dropped. So after sg_remove, there could > still be IO > > to the device outstanding, but no reference on the device (from sg). > > This means the device could get freed even though there's > still some sg > > IO on the queue. > > > > Does this analysis sound reasonable? I'm a bit puzzled by > the fact that > > sd (for example) doesn't appear to have this problem (maybe > sd_release > > just never gets called until all IO comes back?). > > > > > > So, how to fix this? > > > > The first thing I tried was adding a kref to the Sg_device struct, > > patterned after the reference counting in sd, sr, st. In > sg_release, I > > would drop a reference only if there was no IO outstanding. If there > > _was_ IO outstanding, I would drop the reference in > sg_cmd_done when the > > last IO came back. In theory, this would ensure that sg > kept a reference > > on the device until all IO came back. > > > > First problem: I was protecting the kref with a mutex, as > sd, sr, etc. > > do. Trying to down the mutex from sg_cmd_done didn't work > (as this is > > done from softirq). Okay, so I took out the mutex, just to see how > > things would work. Still there was a problem... doing the final > > scsi_device_put in sg_cmd_done won't work, because > > scsi_device_dev_release calls device_del, which may sleep (in this > > kernel version, anyway... I think this has changed in later > kernels). So > > it seems trying to drop a reference when the last command > comes back is > > out of the question. > > > > The next thing I tried was simply sleeping in sg_release until all > > outstanding IO comes back (changing sg_remove_sfp to return > a count of > > outstanding IOs), and only then dropping the device reference (patch > > below). This seems to be working fine. But I'm wondering if this is > > really a good solution? I'm not sure I can think of > anything else... any > > ideas? > > > > Thanks! > > > > Nate Dailey > > Stratus Technologies > > > > > > Here's the "wait in sg_release" patch (which I think is > still useful, > > even if it's not strictly required). This is against a > 2.6.9 kernel, but > > I can resubmit a patch against a later kernel if desired... > > > > --- sg.c.orig 2005-10-13 16:50:00.806798387 -0400 > > +++ sg.c 2005-10-13 16:07:06.365733548 -0400 > > @@ -49,6 +49,7 @@ static int sg_version_num = 30531; /* 2 > > #include <linux/seq_file.h> > > #include <linux/blkdev.h> > > #include <linux/delay.h> > > +#include <linux/kref.h> > > > > #include "scsi.h" > > #include <scsi/scsi_host.h> > > @@ -173,6 +174,7 @@ typedef struct sg_device { /* holds the > > char sgdebug; /* 0->off, 1->sense, > 9->dump dev, 10-> > > all devs */ > > struct gendisk *disk; > > struct cdev * cdev; /* char_dev [sysfs: > > /sys/cdev/major/sg<n>] */ > > + struct kref kref; > > } Sg_device; > > > > static int sg_fasync(int fd, struct file *filp, int mode); > > @@ -218,11 +220,61 @@ static Sg_device **sg_dev_arr = NULL; > > static int sg_dev_max; > > static int sg_nr_dev; > > > > +/* This semaphore is used to mediate the 0->1 reference get in the > > + * face of object destruction (i.e. we can't allow a get on an > > + * object after last put) */ > > +static DECLARE_MUTEX(sg_ref_sem); > > + > > +static void scsi_generic_release(struct kref *kref); > > +#define to_scsi_generic(obj) container_of(obj,Sg_device,kref) > > + > > #define SZ_SG_HEADER sizeof(struct sg_header) > > #define SZ_SG_IO_HDR sizeof(sg_io_hdr_t) > > #define SZ_SG_IOVEC sizeof(sg_iovec_t) > > #define SZ_SG_REQ_INFO sizeof(sg_req_info_t) > > > > +static Sg_device *scsi_generic_get(int dev) > > +{ > > + Sg_device *sdp = NULL; > > + > > + down(&sg_ref_sem); > > + > > + sdp = sg_get_dev (dev); > > + if (!sdp) goto out; > > + > > + if (sdp->detached) { > > + sdp = NULL; > > + goto out; > > + } > > + > > + kref_get(&sdp->kref); > > + > > + if (!sdp->device) > > + goto out_put; > > + > > + if (scsi_device_get(sdp->device)) > > + goto out_put; > > + > > + goto out; > > + > > +out_put: > > + kref_put(&sdp->kref, scsi_generic_release); > > + sdp = NULL; > > +out: > > + up(&sg_ref_sem); > > + return sdp; > > +} > > + > > +static void scsi_generic_put(Sg_device *sdp) > > +{ > > + struct scsi_device *sdev = sdp->device; > > + > > + down(&sg_ref_sem); > > + kref_put(&sdp->kref, scsi_generic_release); > > + scsi_device_put(sdev); > > + up(&sg_ref_sem); > > +} > > + > > static int > > sg_open(struct inode *inode, struct file *filp) > > { > > @@ -235,17 +287,8 @@ sg_open(struct inode *inode, struct file > > > > nonseekable_open(inode, filp); > > SCSI_LOG_TIMEOUT(3, printk("sg_open: dev=%d, > flags=0x%x\n", dev, > > flags)); > > - sdp = sg_get_dev(dev); > > - if ((!sdp) || (!sdp->device)) > > + if (!(sdp = scsi_generic_get(dev))) > > return -ENXIO; > > - if (sdp->detached) > > - return -ENODEV; > > - > > - /* This driver's module count bumped by fops_get in > <linux/fs.h> > > */ > > - /* Prevent the device driver from vanishing while > we sleep */ > > - retval = scsi_device_get(sdp->device); > > - if (retval) > > - return retval; > > > > if (!((flags & O_NONBLOCK) || > > scsi_block_when_processing_errors(sdp->device))) { > > @@ -302,7 +345,7 @@ sg_open(struct inode *inode, struct file > > return 0; > > > > error_out: > > - scsi_device_put(sdp->device); > > + scsi_generic_put(sdp); > > return retval; > > } > > > > @@ -317,13 +360,14 @@ sg_release(struct inode *inode, struct f > > return -ENXIO; > > SCSI_LOG_TIMEOUT(3, printk("sg_release: %s\n", > > sdp->disk->disk_name)); > > sg_fasync(-1, filp, 0); /* remove filp from async > notification > > list */ > > - if (0 == sg_remove_sfp(sdp, sfp)) { /* Returns > 1 when sdp > > gone */ > > - if (!sdp->detached) { > > - scsi_device_put(sdp->device); > > - } > > - sdp->exclude = 0; > > - wake_up_interruptible(&sdp->o_excl_wait); > > - } > > + > > + /* Don't release the device until all IO comes back. */ > > + while (sg_remove_sfp(sdp, sfp)){ > > + msleep(10); > > + } > > + sdp->exclude = 0; > > + wake_up_interruptible(&sdp->o_excl_wait); > > + scsi_generic_put(sdp); > > return 0; > > } > > > > @@ -1238,7 +1285,7 @@ sg_cmd_done(Scsi_Cmnd * SCpnt) > > sfp = srp->parentfp; > > if (sfp) > > sdp = sfp->parentdp; > > - if ((NULL == sdp) || sdp->detached) { > > + if (NULL == sdp) { > > printk(KERN_INFO "sg_cmd_done: device detached\n"); > > scsi_release_request(SRpnt); > > return; > > @@ -1292,18 +1339,8 @@ sg_cmd_done(Scsi_Cmnd * SCpnt) > > scsi_release_request(SRpnt); > > SRpnt = NULL; > > - if (sfp->closed) { /* whoops this fd already released, > > cleanup */ > > - SCSI_LOG_TIMEOUT(1, printk("sg_cmd_done: > already closed, > > freeing ...\n")); > > - sg_finish_rem_req(srp); > > - srp = NULL; > > - if (NULL == sfp->headrp) { > > - SCSI_LOG_TIMEOUT(1, printk("sg...bh: already > > closed, final cleanup\n")); > > - if (0 == sg_remove_sfp(sdp, sfp)) { /* > > device still present */ > > - scsi_device_put(sdp->device); > > - } > > - sfp = NULL; > > - } > > - } else if (srp && srp->orphan) { > > + > > + if (srp && srp->orphan) { > > if (sfp->keep_orphan) > > srp->sg_io_owned = 0; > > else { > > @@ -1441,6 +1478,7 @@ sg_add(struct class_device *cl_dev) > > } > > k = error; > > sdp = sg_dev_arr[k]; > > + kref_init(&sdp->kref); > > > > devfs_mk_cdev(MKDEV(SCSI_GENERIC_MAJOR, k), > > S_IFCHR | S_IRUSR | S_IWUSR | S_IRGRP, > > @@ -1493,45 +1531,33 @@ sg_remove(struct class_device *cl_dev) > > unsigned long iflags; > > Sg_fd *sfp; > > Sg_fd *tsfp; > > - Sg_request *srp; > > - Sg_request *tsrp; > > - int k, delay; > > + int k; > > > > if (NULL == sg_dev_arr) > > return; > > - delay = 0; > > write_lock_irqsave(&sg_dev_arr_lock, iflags); > > for (k = 0; k < sg_dev_max; k++) { > > - sdp = sg_dev_arr[k]; > > - if ((NULL == sdp) || (sdp->device != scsidp)) > > + if ((NULL == sg_dev_arr[k]) || > > + (sg_dev_arr[k]->device != scsidp)) > > continue; /* dirty but lowers > nesting */ > > + sdp = sg_dev_arr[k]; > > + sdp->detached = 1; > > + > > if (sdp->headfp) { > > - sdp->detached = 1; > > for (sfp = sdp->headfp; sfp; sfp = tsfp) { > > tsfp = sfp->nextfp; > > - for (srp = sfp->headrp; > srp; srp = tsrp) > > { > > - tsrp = srp->nextrp; > > - if (sfp->closed || (0 == > > sg_srp_done(srp, sfp))) > > - > sg_finish_rem_req(srp); > > - } > > - if (sfp->closed) { > > - > scsi_device_put(sdp->device); > > - __sg_remove_sfp(sdp, sfp); > > - } else { > > - delay = 1; > > + if (!(sfp->closed)) { > > > > wake_up_interruptible(&sfp->read_wait); > > kill_fasync(&sfp->async_qp, > > SIGPOLL, > > POLL_HUP); > > } > > } > > SCSI_LOG_TIMEOUT(3, > printk("sg_detach: dev=%d, > > dirty\n", k)); > > - if (NULL == sdp->headfp) { > > - sg_dev_arr[k] = NULL; > > - } > > } else { /* nothing active, simple case */ > > SCSI_LOG_TIMEOUT(3, printk("sg_detach: > > dev=%d\n", k)); > > - sg_dev_arr[k] = NULL; > > } > > + > > + sg_dev_arr[k] = NULL; > > sg_nr_dev--; > > break; > > } > > @@ -1543,14 +1569,32 @@ sg_remove(struct class_device *cl_dev) > > cdev_del(sdp->cdev); > > sdp->cdev = NULL; > > devfs_remove("%s/generic", scsidp->devfs_name); > > - put_disk(sdp->disk); > > - sdp->disk = NULL; > > - if (NULL == sdp->headfp) > > - kfree((char *) sdp); > > + > > + down(&sg_ref_sem); > > + kref_put(&sdp->kref, scsi_generic_release); > > + up(&sg_ref_sem); > > } > > +} > > + > > +/** > > + * scsi_generic_release - Called to free the > Sg_device structure > > + * @kref: pointer to embedded kref > > + * > > + * sg_ref_sem must be held entering this routine. > Because it is > > + * called on last put, you should always use the > > scsi_generic_get() > > + * and scsi_generic_put() helpers which manipulate > the semaphore > > + * directly and never do a direct kref_put(). > > + **/ > > +static void scsi_generic_release(struct kref *kref) > > +{ > > + Sg_device *sdp = to_scsi_generic(kref); > > + struct gendisk *disk = sdp->disk; > > + > > + disk->private_data = NULL; > > + put_disk(disk); > > > > - if (delay) > > - msleep(10); /* dirty detach so delay device > > destruction */ > > + kfree((char *) sdp); > > + return; > > } > > > > /* Set 'perm' (4th argument) to 0 to disable > module_param's definition > > @@ -2484,14 +2528,13 @@ __sg_remove_sfp(Sg_device * sdp, Sg_fd * > > sg_page_free((char *) sfp, sizeof (Sg_fd)); > > } > > > > -/* Returns 0 in normal case, 1 when detached and sdp > object removed */ > > +/* Returns a count of IOs still in progress. */ > > static int > > sg_remove_sfp(Sg_device * sdp, Sg_fd * sfp) > > { > > Sg_request *srp; > > Sg_request *tsrp; > > int dirty = 0; > > - int res = 0; > > > > for (srp = sfp->headrp; srp; srp = tsrp) { > > tsrp = srp->nextrp; > > @@ -2505,30 +2548,16 @@ sg_remove_sfp(Sg_device * sdp, Sg_fd * s > > > > write_lock_irqsave(&sg_dev_arr_lock, iflags); > > __sg_remove_sfp(sdp, sfp); > > - if (sdp->detached && (NULL == sdp->headfp)) { > > - int k, maxd; > > - > > - maxd = sg_dev_max; > > - for (k = 0; k < maxd; ++k) { > > - if (sdp == sg_dev_arr[k]) > > - break; > > - } > > - if (k < maxd) > > - sg_dev_arr[k] = NULL; > > - kfree((char *) sdp); > > - res = 1; > > - } > > write_unlock_irqrestore(&sg_dev_arr_lock, iflags); > > } else { > > - /* MOD_INC's to inhibit unloading sg and associated > > adapter driver */ > > - /* only bump the access_count if we > actually succeeded > > in > > - * throwing another counter on the host module */ > > - scsi_device_get(sdp->device); /* XXX: > retval ignored? > > */ > > + /* The non-zero dirty count will be > returned. This will > > tell > > + the caller not to drop the reference on > this device. > > */ > > + > > sfp->closed = 1; /* flag dirty state > on this fd > > */ > > SCSI_LOG_TIMEOUT(1, printk("sg_remove_sfp: > worrisome, %d > > writes pending\n", > > dirty)); > > } > > - return res; > > + return dirty; > > } > > > > static int > > - > > : send the line "unsubscribe > linux-scsi" in > > the body of a message to majordomo@xxxxxxxxxxxxxxx > > More majordomo info at http://vger.kernel.org/majordomo-info.html > > > > - : send the line "unsubscribe linux-scsi" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html