Re: [PATCH v2] xfs: don't zero partial page cache pages during O_DIRECT

Dave Chinner <david@xxxxxxxxxxxxx> · Wed, 20 Aug 2014 14:41:50 +1000

On Wed, Aug 20, 2014 at 12:36:49PM +1000, Dave Chinner wrote:
> On Wed, Aug 20, 2014 at 12:19:50PM +1000, Dave Chinner wrote:
> > On Tue, Aug 19, 2014 at 09:54:14PM -0400, Chris Mason wrote:
> > > On 08/19/2014 06:35 PM, Dave Chinner wrote:
> > > > On Tue, Aug 19, 2014 at 03:24:48PM -0400, Chris Mason wrote:
> > > >> On 08/11/2014 09:17 PM, Dave Chinner wrote:
> > > >>> On Sat, Aug 09, 2014 at 08:57:00AM -0400, Chris Mason wrote:
> > > >>>>
> > > >>>> diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
> > > >>>> index 1f66779..023d575 100644
> > > >>>> --- a/fs/xfs/xfs_file.c
> > > >>>> +++ b/fs/xfs/xfs_file.c
> > > >>>> @@ -295,7 +295,8 @@ xfs_file_read_iter(
> > > >>>>  				xfs_rw_iunlock(ip, XFS_IOLOCK_EXCL);
> > > >>>>  				return ret;
> > > >>>>  			}
> > > >>>> -			truncate_pagecache_range(VFS_I(ip), pos, -1);
> > > >>>> +			invalidate_inode_pages2_range(VFS_I(ip)->i_mapping,
> > > >>>> +					      pos >> PAGE_CACHE_SHIFT, -1);
> > > >>>>  		}
> > > >>>>  		xfs_rw_ilock_demote(ip, XFS_IOLOCK_EXCL);
> > > >>>>  	}
> > > >>>
> > > >>> I added the WARN_ON_ONCE(ret) check to this and I am seeing it fire
> > > >>> occasionally. It always fires immediately before some other ASSERT()
> > > >>> they fires with a block map/page cache inconsistency. It usually
> > > >>> fires in a test that runs fsx or fsstress. The fsx failures are new
> > > >>> regressions caused by this patch. e.g. generic/263 hasn't failed for
> > > >>> months on any of my systems and this patch causes it to fail
> > > >>> reliably on my 1k block size test config.
> > > >>>
> > > >>> I'm going to assume at this point that this is uncovering some other
> > > >>> existing bug, but it means I'm not going to push this fix until I
> > > >>> understand what is actually happening here. It is possible that what
> > > >>> I'm seeing is related to Brian's collapse range bug fixes, but until
> > > >>> I applied this direct IO patch I'd never seen fsx throw ASSERTs in
> > > >>> xfs_bmap_shift_extents()....
> > > >>>
> > > >>> Either way, more testing and understanding is needed.
> > > >>
> > > >> Do you have the output from xfs and the command line args it used?  For
> > > >> my device, it picks:
> > > >>
> > > >> -r 4096 -t 512 -w 512 -Z
> > > >>
> > > >> And for a blocksize 1024 test I did mkfs.xfs -b size=1024
> > > > 
> > > > I'm running:
> > > > 
> > > > $ mkfs.xfs -f -m crc=1,finobt=1 -b size=1k /dev/vda
> > > > $ mount /dev/vda /mnt/test
> > > > $ ltp/fsx -o 128000   -l 500000 -r 4096 -t 512 -w 512 -Z -d /mnt/test/foo
> > > > 
> > > >> But I can't trigger failures with or without the invalidate_inode_pages2
> > > >> change.  I was hoping to trigger on 3.16, and then jump back to 3.10 +
> > > >> my patch to see if the patch alone was at fault.
> > > > 
> > > > I am seeing failures at operation 1192.
> > > > 
> > > > Yesterday, I found a new class of bufferhead state coherency issues
> > > > to do with EOF handling that are causing the problems. Basically,
> > > > when the page cache marks a page dirty, the generic code marks all
> > > > the buffers on the page dirty, even when they are beyond EOF.
> > > > 
> > > > As a result, when we go to invalidate the page that spans EOF, it
> > > > cannot be invalidated because there are dirty buffers on the page.
> > > > Those buffers persist in that state because they are beyond EOF,
> > > > have no blocks allocated to them, and cannot be written. And so when
> > > > we do a direct IO that spans the current EOF, it now fails to
> > > > invalidate that page and so triggers the warning.
> > > > 
> > > > Worse is that it appears that these bufferheads can leak into the
> > > > internal blocks into the file when the file is extended, leading to
> > > > all sorts of other ASSERT failures (which I've been seeing for a
> > > > while now).
> > > > 
> > > > I've got the above fsx command to run for somewhere between 100k and
> > > > 110k operations with the fixes I currently have, but I haven't found
> > > > the cause of the dirty buffer beyond EOF state leaking into the
> > > > interior of the file from extend operations yet.
> > > > 
> > > > Once I have something that passes a few million fsx ops....
> > > 
> > > I have to admit, I'm not sure where this leaves us in terms of safely
> > > applying my patch to our 3.10 or mainline kernel... Failing to
> > > invalidate the page and zeroing the page are really both wrong.
> > 
> > Well, yes they are, and that's one of the reasons why I wanted to
> > ensure we caught failures. As such, I think this invalidation bug
> > goes back a *long time* because we've never, ever checked for
> > invalidation failure before this patch.
> > 
> > However, I think I've got a self-contained fix that can be
> > backported for the invalidation problem now - it's well past 10
> > million fsx ops at this point with both read and write DIO
> > invalidation fixes included. I'll post my series when I've done
> > some more robust testing on it....
> 
> There's more issues. generic/247 on a 4k block size filesystem just
> triggered an invalidation failure on a direct IO write. That test is
> specifically mixing DIO overwrite with mmap overwrite on the same
> file, so there's clearly more issues in play here than I first
> thought.
> 
> So, more debugging needed to determine if this is a result of the
> usual, currently unsolvable page fault vs DIO races...

OK, it's almost certainly page fault vs writeback races in the
DIO code. The page that is failing invalidation is uptodate, dirty
and mapped, which indicates that it was dirtied after writeback
occurred. So, just a demonstration of how mmap+DIO on the same file
is just a bad idea, and having the kernel now issue a warning the
first time this happens is probably a good idea.

Cheers,

Dave.
-- 
Dave Chinner
david@xxxxxxxxxxxxx

_______________________________________________
xfs mailing list
xfs@xxxxxxxxxxx
http://oss.sgi.com/mailman/listinfo/xfs