Re: generic/418 regression seen on 5.12-rc3

Matthew Wilcox <willy@xxxxxxxxxxxxx> · Sun, 28 Mar 2021 03:41:06 +0100




On Mon, Mar 22, 2021 at 12:37:03PM -0400, Eric Whitney wrote:
> * Matthew Wilcox <willy@xxxxxxxxxxxxx>:
> > On Thu, Mar 18, 2021 at 05:38:08PM -0400, Eric Whitney wrote:
> > > * Matthew Wilcox <willy@xxxxxxxxxxxxx>:
> > > > On Thu, Mar 18, 2021 at 02:16:13PM -0400, Eric Whitney wrote:
> > > > > As mentioned in today's ext4 concall, I've seen generic/418 fail from time to
> > > > > time when run on 5.12-rc3 and 5.12-rc1 kernels.  This first occurred when
> > > > > running the 1k test case using kvm-xfstests.  I was then able to bisect the
> > > > > failure to a patch landed in the -rc1 merge window:
> > > > > 
> > > > > (bd8a1f3655a7) mm/filemap: support readpage splitting a page
> > > > 
> > > > Thanks for letting me know.  This failure is new to me.
> > > 
> > > Sure - it's useful to know that it's new to you.  Ted said he's also going
> > > to test XFS with a large number of generic/418 trials which would be a
> > > useful comparison.  However, he's had no luck as yet reproducing what I've
> > > seen on his Google compute engine test setup running ext4.
> > > 
> > > > 
> > > > I don't understand it; this patch changes the behaviour of buffered reads
> > > > from waiting on a page with a refcount held to waiting on a page without
> > > > the refcount held, then starting the lookup from scratch once the page
> > > > is unlocked.  I find it hard to believe this introduces a /new/ failure.
> > > > Either it makes an existing failure easier to hit, or there's a subtle
> > > > bug in the retry logic that I'm not seeing.
> > > > 
> > > 
> > > For keeping Murphy at bay I'm rerunning the bisection from scratch just
> > > to make sure I come out at the same patch.  The initial bisection looked
> > > clean, but when dealing with a failure that occurs probabilistically it's
> > > easy enough to get it wrong.  Is this patch revertable in -rc1 or -rc3?
> > > Ordinarily I like to do that for confirmation.
> > 
> > Alas, not easily.  I've built a lot on top of it since then.  I could
> > probably come up with a moral reversion (and will have to if we can't
> > figure out why it's causing a problem!)
> > 
> > > And there's always the chance that a latent ext4 bug is being hit.
> > 
> > That would also be valuable information to find out.  If this
> > patch is exposing a latent bug, I can't think what it might be.
> > 
> > > I'd be very happy to run whatever debugging patches you might want, though
> > > you might want to wait until I've reproduced the bisection result.  The
> > > offsets vary, unfortunately - I've seen 1024, 2048, and 3072 reported when
> > > running a file system with 4k blocks.
> > 
> > As I expected, but thank you for being willing to run debug patches.
> > I'll wait for you to confirm the bisection and then work up something
> > that'll help figure out what's going on.
> 
> An update:
> 
> I've been able to rerun the bisection on a freshly installed test appliance
> using a kernel built from a new tree using different test hardware.  It ended
> at the same patch - "mm/filemap: support readpage splitting a page".
> 
> I'd thought I'd been able to reproduce the failure using ext4's 4K block size,
> but my test logs say otherwise, as do a few thousand trial runs of generic/418.
> So, at the moment I can only say definitively that I've seen the failures when
> running with 1k blocks (block size < page size).
> 
> I've been able to simplify generic/418 by reducing the set of test cases it
> uses to just "-w", or direct I/O writes.  When I modify the test to remove
> just that test case and leaving the remainder, I no longer see test failures.
> A beneficial side effect of the simplification is that it reduces the running
> time by an order of magnitude.
> 
> When it fails, it looks like the test is writing via dio a chunk of bytes all
> of which have been set to 0x1 and then reading back via dio the same chunk with
> all those bytes equal to 0x0.
> 
> I'm attempting to create a simpler reproducer.  generic/418 actually has a
> number of processes simultaneously writing to and reading from disjoint
> regions of the same file, and I'd like to see if the failure is dependent
> upon that behavior, or whether it can be isolated to a single writer/reader.

I've managed to reproduce this bug here now.  There are _two_ places
in xfstests that I need to specify "-b 1024" *grumble*.

You've definitely identified the correct commit.  I can go back-and-forth
applying it and removing it, and the failure definitely tracks this
commit.

I think what's going on is that we have a dio write vs a buffered read.
That is, we're only specifying -w to diotest and not -w -r.  So we have
a read-vs-invalidate situation.

What used to happen:

 - Look up page in page cache, happens to be there
 - Check page is uptodate (it isn't; another thread started IO that hasn't
   finished yet)
 - Wait for page to be unlocked when the read finishes
 - See if page is now uptodate (it is)
 - Copy data from it

What happens now:

 - Look up page in page cache, happens to be there
 - Check page is uptodate (it isn't; another thread started IO that hasn't
   finished yet)
 - Wait for page to be unlocked when the read finishes
 - Go back and look up the page again
 - Allocate a new page and call readpage
 - Wait for IO to complete
 - Copy data from it

So if there's a subtle race in ext4 between DIO writes and ->readpage,
this might be exposing it?

I think this theory is given more credence by going to the parent commit
(480546259811) and applying this:

+++ b/mm/filemap.c
@@ -2301,6 +2301,10 @@ static struct page *filemap_update_page(struct kiocb *iocb, struct file *filp,
                put_page(page);
                return ERR_PTR(error);
        }
+       if (!page->mapping) {
+               put_page(page);
+               return NULL;
+       }
        if (PageUptodate(page))
                return page;
 
ie refusing to use a truncated page, even if it's still marked uptodate.
That reproduces the g/418 crash within a minute for me.

Note that if the read had come in a few instructions later, it wouldn't've
found a page in the page cache, so this is just widening the window in
which this race can happen.