On Wed, 2018-04-11 at 12:22 -0700, Anatoliy Glagolev wrote: > Hannes, James, thanks a lot for taking a look! > > On the problem the patch is solving: it is in the "Description" part > of my initial e-mail. If you agree that a Scsi_Host may be around > after a driver has unloaded, the problem applies to any driver > creating a new Scsi_Host. No, I don't agree: as I said, the template is part of the module and the module should be reference counted. Any use after free of the template means there's a refcounting bug somewhere. > I fixed it in qla2xxx to illustrate the usage of the new function > and scsi_host_template's flag; also, qla2xxx is where I actually > observe crashes. Other drivers may do the same if they want to > address the problem. > > Here are details on the qla2xxx crash repro, if that is what you were > asking about. If I run "qaucli" utility that retrieves some info from > the driver via SCSI mid-layer, and unload the driver in parallel, the > kernel crashes with the following stack: > > [16834.636216,07] Call Trace: > ... > scsi_proc_hostdir_rm > [16834.641944,07] [<ffffffff8141723f>] > scsi_host_dev_release+0x3f/0x130 > [16834.647740,07] [<ffffffff813e4f82>] device_release+0x32/0xa0 > [16834.653423,07] [<ffffffff812dc6c7>] kobject_cleanup+0x77/0x190 > [16834.659002,07] [<ffffffff812dc585>] kobject_put+0x25/0x50 > [16834.664430,07] [<ffffffff813e5277>] put_device+0x17/0x20 > [16834.669740,07] [<ffffffff812d0334>] > bsg_kref_release_function+0x24/0x30 > [16834.675007,07] [<ffffffff812d14a6>] bsg_release+0x166/0x1d0 > [16834.680148,07] [<ffffffff8119ba2b>] __fput+0xcb/0x1d0 > [16834.685156,07] [<ffffffff8119bb6e>] ____fput+0xe/0x10 > [16834.690017,07] [<ffffffff81077476>] task_work_run+0x86/0xb0 > [16834.694781,07] [<ffffffff81057043>] > exit_to_usermode_loop+0x6b/0x9a > [16834.699466,07] [<ffffffff81002875>] > syscall_return_slowpath+0x55/0x60 > [16834.704110,07] [<ffffffff8172d615>] > int_ret_from_sys_call+0x25/0x9f This one's a bit baffling: open of the bsg device should have already taken the module reference. What was the actual error: NULL deref? The thing which is supposed to hold the module is the device open/close which does scsi_device_put on sd_release ... unless this is some sort of non-scsi device and qlogic forgot how to refcount? > On refcount for scsi_host_template: valid point, I did consider it. > Existing drivers allocate scsi_host_template statically. We cannot > change them all at once. So we have to allow 2 ways of allocating > scsi_host_template: the dynamic one with refcounts and the static one > for legacy driver support. That is kind of ugly, too. In addition, > having a refcounted scsi_host_template after driver unload is > confusing: the memory of scsi_host_template is OK, but any attempt to > call a method from the template still causes a crash. No, the static template already is part of the module so it should be refcounted as a module reference. James