Re: Process for severe early stable bugs?

Greg KH <gregkh@xxxxxxxxxxxxxxxxxxx> · Mon, 10 Dec 2018 10:51:02 +0100

On Sun, Dec 09, 2018 at 11:44:19AM -0500, Theodore Y. Ts'o wrote:
> On Sun, Dec 09, 2018 at 12:30:39PM +0100, Greg KH wrote:
> > > P.P.P.S.  If I were king, I'd be asking for a huge number of kunit
> > > tests for block-mq to be developed, and then running them under a
> > > Thread Sanitizer.
> > 
> > Isn't that what xfs and fio is?  Aren't we running this all the time and
> > reporting those issues?  How did this bug not show up on those tests, is
> > it just because they didn't run long enough?
> > 
> > Because of those test suites, I was thinking that the block and
> > filesystem paths were one of the more well-tested things we had at the
> > moment, is this not true?
> 
> I'm pretty confident about the file system paths, and the "happy
> paths" for the block layer.
> 
> But with Kernel Bugzilla #201685, despite huge amounts both before and
> after 4.19-rc1, nothing picked it up.  It turned out to be very
> configuration specific, *and* only happened when you were under heavy
> memory pressure and/or I/O pressure.
> 
> I'm starting to try to use blktests, but it's not as mature as
> xfstests.  It has portability issues, as it assumes a much newer
> userspace.  So I can't even run it under some environments at all.
> The test coverage just isn't as broad.  Compare:
> 
> ext4/4k: 441 tests, 1 failures, 42 skipped, 4387 seconds
>   Failures: generic/388
> 
> Versus:
> 
> Run: block/001 block/002 block/003 block/004 block/005 block/006
>     block/009 block/010 block/012 block/013 block/014 block/015
>     block/016 block/017 block/018 block/020 block/021 block/023
>     block/024 loop/001 loop/002 loop/003 loop/004 loop/005 loop/006
>     nvme/002 nvme/003 nvme/004 nvme/006 nvme/007 nvme/008 nvme/009
>     nvme/010 nvme/011 nvme/012 nvme/013 nvme/014 nvme/015 nvme/016
>     nvme/017 nvme/019 nvme/020 nvme/021 nvme/022 nvme/023 nvme/024
>     nvme/025 nvme/026 nvme/027 nvme/028 scsi/001 scsi/002 scsi/003
>     scsi/004 scsi/005 scsi/006 srp/001 srp/002 srp/003 srp/004
>     srp/005 srp/006 srp/007 srp/008 srp/009 srp/010 srp/011 srp/012 srp/013
> Failures: block/017 block/024 nvme/002 nvme/003 nvme/008 nvme/009
>     nvme/010 nvme/011 nvme/012 nvme/013 nvme/014 nvme/015 nvme/016
>     nvme/019 nvme/020 nvme/021 nvme/022 nvme/023 nvme/024 nvme/025
>     nvme/026 nvme/027 nvme/028 scsi/006 srp/001 srp/002 srp/003 srp/004
>     srp/005 srp/006 srp/007 srp/008 srp/009 srp/010 srp/011 srp/012 srp/013
> Failed 37 of 69 tests
> 
> (Most of the failures are test portability issues that I still need to
> work through, not real failures.  But just look at the number of
> tests....)

So you are saying quantity rules over quantity?  :)

It's really hard to judge this, given that xfstests are testing a whole
range of other things (POSIX compliance and stressing the vfs api),
while blktests are there to stress the block i/o api/interface.

So both would be best to run as we know xfstests also hits the block
layer...

thanks,

greg k-h