On Mon, 2020-09-07 at 17:32 -0400, Douglas Gilbert wrote: > On 2020-09-07 1:46 p.m., James Bottomley wrote: > > On Mon, 2020-09-07 at 17:47 +0200, Tomas Henzl wrote: > > > During an async scan the driver shost->hostt structures are used, > > > that may cause issues when the driver is removed at that time. > > > As protection take the module reference. > > > > Can I just ask what issues? Today, our module model is that > > scsi_device_get() bumps the module refcount and therefore makes the > > module ineligible to be removed. scsi_host_get() doesn't do this > > because the way the host model is supposed to be coded, we can call > > remove at any time but the module won't get freed until the last > > put of the host. I can see we have a potential problem with > > scsi_forget_host() racing with the async scan thread ... is that > > what you see? What's supposed to happen is that scsi_device_get() > > starts failing as soon as the module begins it's exit routine, so > > if a scan is in progress, it can't add any new devices ... in > > theory this means that the list is stable for scsi_forget_host(), > > so knowing how that assumption is breaking would be useful. > > James, > If you think it is bullet-proof try using I'm not saying it's got no bugs, just that the above is the way it's supposed to work. > CONFIG_DEBUG_TEST_DRIVER_REMOVE=y . The problem with this option is it basically gives you a thundering herd of removal reinsertions ... trying to do it for a single driver (or set of drivers) is likely a better way to get actionable debugging information. > John Garry reported that: > > # insmod scsi_debug.ko > > Gave errors like this: > > [ 140.115244] debugfs: Directory 'sde' with parent 'block' already > present! > [ 140.376426] debugfs: Directory 'sde' with parent 'block' already > present! > [ 140.420613] sd 3:0:0:0: [sde] tag#40 access beyond end of device > [ 140.426655] blk_update_request: I/O error, dev sde, sector 15984 > op 0x0:(READ) flags 0x80700 phys_seg 1 prio class 0 > [ 140.437319] sd 3:0:0:0: [sde] tag#41 access beyond end of device > [ 140.443368] blk_update_request: I/O error, dev sde, sector 15984 > op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0 > ... > > Which wasn't the scsi_debug driver directly as it doesn't use > debugfs. So I suspect something is rotten in the mid-level. > > When I tried to replicate John's config I couldn't even boot my > Ubuntu 20.04 based system (with a MKP kernel). Seemed to fail/lockup > before any kernel prints came out to the serial port (yes, still > useful), perhaps in initrd. I'm guessing another, non-SCSI module > caused the lockup. So I gave up and turned off that config setting. If that can be distilled down to a better test case, I can look into it. James