On Wed, 26 Mar 2008 20:32:13 -0500 Mike Christie <michaelc@xxxxxxxxxxx> wrote: > FUJITA Tomonori wrote: > > On Wed, 26 Mar 2008 07:36:26 -0700 > > James Bottomley <James.Bottomley@xxxxxxxxxxxxxxxxxxxxx> wrote: > > > >> On Wed, 2008-03-26 at 23:22 +0900, FUJITA Tomonori wrote: > >>> On Sat, 22 Mar 2008 11:06:00 -0500 > >>> James Bottomley <James.Bottomley@xxxxxxxxxxxxxxxxxxxxx> wrote: > >>> > >>>> On Tue, 2008-03-11 at 00:36 -0500, Mike Christie wrote: > >>>>> Mike Christie wrote: > >>>>>> Pete Wyckoff wrote: > >>>>>>> I think this used not to happen; not sure. But I changed two things > >>>>>> This most likely did not happen before 2.6.25-rc* or it broke in > >>>>>> slightly different ways, because iscsi used to try and do > >>>>>> > >>>>>> echo 1 > /sys/block/sdX/device/delete > >>>>>> > >>>>>> from userspace instead of calling scsi_remove_target from the kernel. > >>>>>> > >>>>>> As you know around 2.6.21, the behavior of doing the echo to the delete > >>>>>> file changed due to a driver model and scsi change and that broke the > >>>>>> iscsi tools. The iscsi tools userspace removal was sort of hack in the > >>>>>> first place and was racey, so we switched to removing devices/target > >>>>>> like the FC class. > >>>>>> > >>>>>> > >>>>>>> lately. 2.6.25-rc1 to -rc4 and fedora 8 iscsi-initiator-utils (865) to > >>>>>>> fedora devel (868). Bidi and varlen patches always too. > >>>>>>> > >>>>>>> I'll follow with some more variations on this theme. Looks like bsg > >>>>>>> needs to protect more carefully against the device going away. Any > >>>>>>> ideas how best to do this? What was the approach in sg? > >>>>>>> > >>>>>> I think sg is broken in similar ways. The iser guys have some tests > >>>>>> cases that have broken sg while IO is outstanding. I am ccing Erez. > >>>>> Actually one of the problems looks a little different than some of the > >>>>> problems hit with sg and are caused because we remove the bsg device too > >>>>> soon. I think we want to wait until all the references from the > >>>>> commands/requests are released. The attached patch (untested) moves the > >>>>> bsg unreg call to the scsi device release fn. > >>>> Well, this fix is now upstream. However, it's causing all our > >>>> scsi_devices never to get released, which is a serious regression. > >>>> We're also doing spurious bsg_unregister_queue() for things that never > >>>> actually registered one (all scan devices that return DID_NO_CONNECT), > >>>> but bsg doesn't seem to be complaining about this. > >>>> > >>>> The essence of the problem is that bsg_register_queue() takes a ref to > >>>> the sdev_gendev, so you can't move bsg_unregister_queue() into the > >>>> release function because nothing ever puts bsg's device ref and so > >>>> release is never called. > >>>> > >>>> Options for fixing this before 2.6.25 are > >>>> > >>>> 1. revert the patch > >>>> 2. Do an additional put for the bsg reference in > >>>> __scsi_remove_device (patch below). It's nasty but it preserves > >>>> the semantics and does what you want > >>> After some investigation, this patch doesn't fix the bug that Pete > >>> reported (I'll send a new patch shortly). > >>> > >>> Can you revert the commit 4b6f5b3a993cbe34b4280f252bccc76967c185c8 > >>> instead of merging this? > >> Sure ... I didn't like the hack either. As long as iSCSI is fine with > >> the reversion it's the quickest way to fix the problem. > > > > How about this? With the commit reversion, I confirmed that this patch > > fixes the first bug that Pete reported: > > > > http://marc.info/?l=linux-scsi&m=120508166505141&w=2 > > > > I suspect that this could fix the rest too. > > > > = > > From: FUJITA Tomonori <fujita.tomonori@xxxxxxxxxxxxx> > > Subject: [PATCH] bsg: takes a ref to struct device in fops->open > > > > bsg_register_queue() takes a ref to struct device that a caller > > passes. For example, it takes a ref to the sdev_gendev with scsi > > devices. However, bsg doesn't takes a ref to it in fops->open. So > > while an application opens a bsg device, the scsi device that the bsg > > device holds can go away (bsg also takes a ref to a queue, but it > > doesn't prevent the device from going away). > > > > With this, bsg takes a ref to struct device in fops->open and frees it > > in fops->release. > > > > It looks like it fixes the life time problem. With the reverting and my patch, seems that all the problems (#1, #2, and #3) has gone for me. > My patch was actually supposed to fix #3 and fixing #1 was a side > affect. Will bsg_release still be called when the device is closed. If > so then it may not fix #3 because the bsg_release function still needs > to grab the mutex. Maybe bsg_complete_all_commands just needs to drop > the mutex while it waits for IO to complete. I don't hit #3 problem. A process holds the mutex and waiting for I/O completion. But fail_all_commands() makes all the commands fail, the process releases the mutex and then bsg_unregister_queue is called. But yeah, I think that we don't need to hold the mutex during waiting for I/O completion here. -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html