scsi_debug in fstests and blktests (Was: Re: Fwd: [bug report][bisected] modprob -r scsi-debug take more than 3mins during blktests srp/ tests)

Luis Chamberlain <mcgrof@xxxxxxxxxx> · Thu, 21 Apr 2022 10:53:30 -0700

Moving this discussion to the lists as we need to really think
about how testing on fstests and blktests uses scsi_debug for
a high confidence in baseline without false positives on failures
due to the inability to the remove scsi_debug module.

This should also apply to other test debug modules like null_blk,
nvme target loop drivers, etc, it's all the same long term. But yeah
scsi surely make this... painful today. In any case hopefully folks
with other test debug drivesr are running tests to ensure you can
always rmmod these modules regardless of what is happening.

On Tue, Apr 12, 2022 at 06:03:40PM -0400, Douglas Gilbert wrote:
> On 2022-04-12 13:48, Luis Chamberlain wrote:
> > On Thu, Apr 07, 2022 at 10:09:54PM -0400, Douglas Gilbert wrote:
> > > Hi,
> > > Is it time to revert this patch?
> > 
> > Upstream kmod will indeed get patched soon witha  --patient-remove
> > option. So the issue is that. However, it doesn't mean driver's
> > can't / should strive to avoid these issues if they can. That is a
> > thing left to driver's to implement / resolve if they want.
> > 
> > In the meantime userspace should change to user the patient removal,
> > and if the upstream kmod doesn't have yet have it (note, the code is
> > not yet merged) then tools doing module removal should open code the
> > module removal. I modified fstests to do open coding of the patient
> > module removal in case kmod does not support it.  I have a similar patch
> > for blktests but that still requires regression testing on my part. I
> > hope to finish that soon though.
> > 
> > So the answer to your question: it depends on how well you want to deal
> > with these issues for users, or punt the problems to patient removal
> > usage.
> 
> Hi,
> There is a significant amount of work bringing down a driver like scsi_debug.
> Apart from potentially consuming most of the ram on a box, it also has the
> issue of SCSI commands that are "in flight" when rmmod is called.
> 
> So I think it is approaching impossible to make rmmod scsi_debug the equivalent
> of an atomic operation. There are just too many moving parts, potentially
> moving asynchronously to one another. This is an extremely good test for the
> SCSI/block system, roughly equivalent to losing a HBA that has a lot of disks
> behind it. Will the system stabilize and how long will that take?

I understand. But I really cannot buy "impossible". Impossible I think should
mean a design flaw somewhere.

At least for now I think we should narrow our objectives so that
this is *possible* within the context of fstests and blktests because
otherwise *we really should not be using scsi_debug* for high fidelity
in testing. One of the reasons is that we want to be able to run
fstests or blktests in a loop with confidence so that failures are
real. A failure due to the inability to not remove a debug module
makes gaining confidence in a baseline a bit difficult and you'd have
to implement hacks around it.

mcgrof@fulton ~/devel/blktests (git::master)$ git grep _have_scsi_debug tests | wc -l
10

mcgrof@fulton ~/devel/xfstests-dev (git::master)$ git grep _require_scsi_debug tests| wc -l
5

Not insane, but enough for us to care, but I think if we *narrow* our
scope to ensure scsi_debug *can* be removed *at least* with the patient
module remover we're good.

Do you think this is viable goal for scsi_debug?

> Setting up races between modprobe and rmmod on scsi_debug was certainly not
> top of mind for me.

Oh I get it. But the community has already embraced it for years on
fstests and blktests. So at this point I think we have no other option.

I think one thing we *can* do is *not* use scsi_debug for tests which
*really don't need scsi*.

> Storage systems such as SCSI are a lot better defined
> (and ordered) in the power-up scenario. Even with asynchronous scanning
> (discovery) of devices (even SSDs) it can take 10 plus seconds to bring up
> devices with a lot more handshaking between controller and the storage
> device. And even with SSDs, there is increased power draw during power-up
> (hard disks obviously need to accelerate the medium up to the rated speed).
> That leads to big storage arrays staggering when they apply power to
> different banks of SSDs/disks.

Sure..

> I wonder if anyone has tested building scsi_mod (the SCSI mid-level) as a
> module and tried rmmod on it while, say, a USB key is being read :-)

:)

  Luis