Sent out a new patch with "bsg referencing bus driver module" in subj. On Wed, Apr 11, 2018 at 4:37 PM, Anatoliy Glagolev <glagolig@xxxxxxxxx> wrote: > On "what was the actual error": it is deref of an invalid address, not > NULL. Attaching crash dump analysis for the reference. > > On module reference count: good point. I decided against it at first, > but I can reconsider. "modprobe -r qla2xxx" will fail if there is an > extra reference to the module, and the module_exit function will not > even run, right? Waiting for references to go away would be more > convenient for me. But I can see why the module reference count is a > better approach in general. I can work around and retry "modprobe -r > qla2xxx" multiple times in my scripts. > > I think that it is still a SCSI mid-layer job to do the references. > There is no way qla2xxx can reference itself and then dereference at > the right time. > > qla2xxx (or any other driver) provides a pointer to its module in > scsi_host_template when it requests Scsi_Host creation. As far as I > can see, no one ever takes a reference on that module. SCSI mid-layer > just relies on the module to be around. Scsi_Host is a device itself; > that is the device that is referenced on open/close from user mode, > and not the bus driver that triggered the Scsi_Host creation. > > SCSI mid layer taking a reference on the template's module at > Scsi_Host creation in scsi_host_alloc(..) and dropping it in > scsi_host_dev_release (called when the last reference to Scsi_Host is > gone) will not work. Assuming that the module_exit function does not > run at an attempt to unload a referenced module, qla2xxx's Scsi_Host-s > corresponding to the adapter's ports will stay forever. > > Let me think more about it; the idea is to intercept open/close at > Scsi_Host and increment/decrement module reference at that time. > > Thanks a lot for the input! > > > On Wed, Apr 11, 2018 at 1:12 PM, James Bottomley > <James.Bottomley@xxxxxxxxxxxxxxxxxxxxx> wrote: >> On Wed, 2018-04-11 at 12:22 -0700, Anatoliy Glagolev wrote: >>> Hannes, James, thanks a lot for taking a look! >>> >>> On the problem the patch is solving: it is in the "Description" part >>> of my initial e-mail. If you agree that a Scsi_Host may be around >>> after a driver has unloaded, the problem applies to any driver >>> creating a new Scsi_Host. >> >> No, I don't agree: as I said, the template is part of the module and >> the module should be reference counted. Any use after free of the >> template means there's a refcounting bug somewhere. >> >>> I fixed it in qla2xxx to illustrate the usage of the new function >>> and scsi_host_template's flag; also, qla2xxx is where I actually >>> observe crashes. Other drivers may do the same if they want to >>> address the problem. >>> >>> Here are details on the qla2xxx crash repro, if that is what you were >>> asking about. If I run "qaucli" utility that retrieves some info from >>> the driver via SCSI mid-layer, and unload the driver in parallel, the >>> kernel crashes with the following stack: >>> >>> [16834.636216,07] Call Trace: >>> ... >>> scsi_proc_hostdir_rm >>> [16834.641944,07] [<ffffffff8141723f>] >>> scsi_host_dev_release+0x3f/0x130 >>> [16834.647740,07] [<ffffffff813e4f82>] device_release+0x32/0xa0 >>> [16834.653423,07] [<ffffffff812dc6c7>] kobject_cleanup+0x77/0x190 >>> [16834.659002,07] [<ffffffff812dc585>] kobject_put+0x25/0x50 >>> [16834.664430,07] [<ffffffff813e5277>] put_device+0x17/0x20 >>> [16834.669740,07] [<ffffffff812d0334>] >>> bsg_kref_release_function+0x24/0x30 >>> [16834.675007,07] [<ffffffff812d14a6>] bsg_release+0x166/0x1d0 >>> [16834.680148,07] [<ffffffff8119ba2b>] __fput+0xcb/0x1d0 >>> [16834.685156,07] [<ffffffff8119bb6e>] ____fput+0xe/0x10 >>> [16834.690017,07] [<ffffffff81077476>] task_work_run+0x86/0xb0 >>> [16834.694781,07] [<ffffffff81057043>] >>> exit_to_usermode_loop+0x6b/0x9a >>> [16834.699466,07] [<ffffffff81002875>] >>> syscall_return_slowpath+0x55/0x60 >>> [16834.704110,07] [<ffffffff8172d615>] >>> int_ret_from_sys_call+0x25/0x9f >> >> This one's a bit baffling: open of the bsg device should have already >> taken the module reference. What was the actual error: NULL deref? >> >> The thing which is supposed to hold the module is the device open/close >> which does scsi_device_put on sd_release ... unless this is some sort >> of non-scsi device and qlogic forgot how to refcount? >> >>> On refcount for scsi_host_template: valid point, I did consider it. >>> Existing drivers allocate scsi_host_template statically. We cannot >>> change them all at once. So we have to allow 2 ways of allocating >>> scsi_host_template: the dynamic one with refcounts and the static one >>> for legacy driver support. That is kind of ugly, too. In addition, >>> having a refcounted scsi_host_template after driver unload is >>> confusing: the memory of scsi_host_template is OK, but any attempt to >>> call a method from the template still causes a crash. >> >> No, the static template already is part of the module so it should be >> refcounted as a module reference. >> >> James >>