Re: [PATCH] scsi_debug: address races following module load

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Fri, Jul 23, 2021 at 03:49:41PM -0700, Luis Chamberlain wrote:
> On Sat, May 08, 2021 at 07:07:45PM -0400, Douglas Gilbert wrote:
> > When scsi_debug is loaded as a module with many (simulated) hosts,
> > targets, and devices (LUs), modprobe can take a long time to return.
> > Only a small amount of this time is spent in the scsi_debug_init();
> > the rest is other parts of the kernel reacting to to the appearance
> > of new storage devices. As soon as scsi_debug_init() has completed
> > the user space may call 'rmmod scsi_debug' and this was found to
> > cause race problems as outlined here:
> >     https://bugzilla.kernel.org/show_bug.cgi?id=212337
> > 
> > To reliably generate this race a sysfs parameter called rm_all_hosts
> > was added and the code was strengthened in this area. The main
> > change was to make the count of scsi_debug hosts present an atomic.
> > Then it was found that the handling of the existing add_host
> > parameter needed the same strengthening. Further:
> >    echo -9999 > /sys/bus/pseudo/drivers/scsi_debug/add_host
> > has the same effect as rm_all_hosts so rm_all_hosts was not needed.
> > 
> > To inhibit a race between two invocations of writes to add_host
> > a mutex was added.
> > 
> > The logic to remove (all) hosts is rather crude: it works backwards
> > down a linked lists of hosts. Any pending requests are terminated
> > with DID_NO_CONNECT as are any new requests. In the case where not
> > all hosts are being removed, the ones that remain may have lost
> > requests as just outlined. The lowest numbered host (id) hosts will
> > remain.
> > 
> > To distinguish between resets sent by the SCSI mid-level error
> > handling and newly introduced devices (LUs), this Unit Attention:
> >    power on, reset, or bus reset occurred [0x29,0x0]
> > has been subdivided into that UA for the reset case and this new UA:
> >    power on occurred [0x29,0x1]
> > for the new device (LU) case. This makes debug a little easier to
> > follow when it is turned on (e.g. 'echo 0x1 > opts').
> > 
> > 
> > This patch should apply to lk 5.13.0-rc1 due out tomorrow unless some
> > other patches have been applied to this driver that I'm unaware of.
> > 
> > Signed-off-by: Douglas Gilbert <dgilbert@xxxxxxxxxxxx>
> 
> This is a lot of changes and it still does not address the issues on
> the bugzilla bug report. For starts a loop on modprobe and rmmod fails.
> 
> We need udevadm settle (when availble) in between loads, and then the
> removal simply needs to be patient.

This is a curious overstatement in my testing now but in subtle ways.
The skinny is that this helps *older* kernels when the refcnt is 0.

Turns out that although for linux-next your patch really rarely makes a
difference, on older kernels (by backporting it mysef, say a 5.3
kernel), I see the benefit to your patch when the module refcnt is 0 and
after fstests generic/108 on xfs. Without your patch even if refcnt is 0
you must wait over 6 seconds before you can issue an rmmod if you want
it to suceed with a failure rate of about 1/1026.

In cooking up my fstests patches then, what would be helpful is if you
modified your patch so that it introduces a u32 debugfs file which
userspace can then query if it exits. It would use this to tell if the
quiescing effort is present. Since, this is likely to get better over time,
perhaps have it reflect a version for improvements.

This way if this patch is *not* merged, we'd have to sleep for over 6
seconds if you want a higher failure rate on certain tests.

That said, looking at test generic/108 on v5.3 with xfs with the
xfs_nocrc_512 configuration (-m crc=0,reflink=0,rmapbt=0, -i sparse=0,
-b size=512) should give you an idea of what might be left to do, so
that delays before rmmod can be trimmed further.

To be clear: the delays *are not* needed on linux-next, but on older
kernels they are. This might be telling of some races possible on older
kernels where module removal is *not* possible even though the refcnt is 0.

I'll mention this on my fstests patches.

  Luis



[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Index of Archives]     [SCSI Target Devel]     [Linux SCSI Target Infrastructure]     [Kernel Newbies]     [IDE]     [Security]     [Git]     [Netfilter]     [Bugtraq]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux ATA RAID]     [Linux IIO]     [Samba]     [Device Mapper]

  Powered by Linux