Re: [PATCH v3] vfs: fix a bug when we do some dio reads with append dio writes

Zheng Liu <gnehzuil.liu@xxxxxxxxx> · Tue, 17 Dec 2013 19:16:26 +0800

Hi Steven,

On Tue, Dec 17, 2013 at 09:43:42AM +0000, Steven Whitehouse wrote:
> Hi,
> 
> On Mon, 2013-12-16 at 11:01 +0100, Jan Kara wrote:
> > On Mon 16-12-13 09:37:34, Steven Whitehouse wrote:
> > > Hi,
> > > 
> > > On Sat, 2013-12-07 at 18:55 +0800, Zheng Liu wrote:
> > > > From: Zheng Liu <wenqing.lz@xxxxxxxxxx>
> > > > 
> > > > Currently we check i_size in buffered read path after we know the page
> > > > is update.  But it could return some zero-filled pages to the userspace
> > > > when we do some append dio writes.  We could use the following code
> > > > snippet to reproduce this problem in a ext2/3/4 filesystem.
> > > > 
> > > If the page is not uptodate, then neither (potentially) is i_size, since
> > > the underlying fs has not had a chance to get its own locks and update
> > > the inode size.
> >   Hum, this isn't really about page being uptodate or not, is it? It is
> > more about the fact that gfs2 updates i_size only from gfs2_readpage()
> > function when obtaining inode lock. Or do you know about other filesystem
> > having the same problem as gfs2? E.g. ocfs2 updates i_size from
> > ocfs2_file_aio_read() before calling generic_file_aio_read(), cannot gfs2
> > do something similar?
> > 
> Well we could do that, but I don't think it makes sense really. The
> point of having the page cache is to put it "in front" of the expensive
> operations where possible (such as getting cluster locks). I don't think
> it makes sense to grab and drop a cluster lock on every call to
> generic_file_aio_read() when, for cached pages, we would not even need
> to call ->readpage.
> 
> The problem appears to be that somehow we are getting a page returned
> when it should not be. So I suspect that the correct place to catch that
> is in ->readpage(s)
> 
> > > I suspect that the correct fix would be to implement ->launder_page to
> > > avoid the race that you've identified here, if I've understood it
> > > correctly,
> >   I don't understand how that would work. Can you elaborate a bit? Here the
> > problem is i_size gets extended by DIO write during buffered read (which is
> > a fallback from DIO read) so we return zero-filled page instead of either
> > data filled page or short read. I don't see where launder_page() would come
> > into the picture...
> > 
> > 								Honza
> 
> Ah, sorry. I was thinking the other way around wrt to
> read/write :( However, I still don't think that generic_file_aio_read()
> is the right place to fix this. I note that XFS seems to pass the test
> and it also uses mpage_readpage and mpage_readpages as well as
> generic_file_aio_read() so maybe that is a clue as to where the answer
> lies. GFS2 seems to fail the test in the same way as ext3 at the moment.

Yes, xfs can pass the test under direct IO.  I suspect that the reason
is that xfs uses ilock/iolock.  But it doesn't means that all file
systems need to fix this issue by themselves because the root cause is
in vfs layer.

> 
> The other question that we've not really answered is why it is useful to
> mix buffered and direct i/o to the same file at the same time anyway.
> While I agree that it ought to work, I'm not convinced that its
> enormously useful, which is maybe why it has not been spotted before,

Yes, mixed buffered read and direct write seems to be useless.  But in
our product system, the issue will appear under direct read/write.  The
application has a reader and a writer.  The reader should read nothing
or the content that has been written by writer.  But now the reader gets
the zero-filled page.  Hence the application needs to prevent this issue
using a mutex or other flag.  It doesn't make sense to the userspace
programmer.

TBH, we have found this issue two year ago but we don't take care of it.
So currently the application just solves it as I said above.  But it will
issue a flush calling fsync(2) to flush the dirty page in order to make
sure the data has been flushed into the disk.  Obviously, the performance
of this application is impacted by this issue.  Now we have known that we
don't need to flush the dirty page but I think we'd better fix it.

Regards,
                                                - Zheng
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html