Re: [PATCH v3] vfs: fix a bug when we do some dio reads with append dio writes

Steven Whitehouse <swhiteho@xxxxxxxxxx> · Tue, 14 Jan 2014 15:22:13 +0000

Hi,

On Mon, 2013-12-23 at 14:00 +1100, Dave Chinner wrote:
> On Fri, Dec 20, 2013 at 09:28:44AM +0000, Steven Whitehouse wrote:
> > On Fri, 2013-12-20 at 09:44 +1100, Dave Chinner wrote:
> 
> [snip]
> 
> > > Ok, buffered writes explain all those symptoms, and that means what
> > > you are seeing is a buffered write/direct IO read race condition.
> > > If the first page is missing data from the direct IO read, that
> > > implies that the inode size has been updated by the buffered write
> > > path, but a direct Io read of zeroes means the data in the page
> > > cache has not been flushed to disk. Given that the direct IO read
> > > path does a filemap_write_and_wait_range() call before issuing the
> > > direct IO, that implies the first page was not flushed to disk by
> > > the cache flush.
> > > 
> > > I'm not sure why that may be the case, but that's where I'd
> > > start looking.....
> > > 
> > Indeed - I've just spent the last couple of days looking at exactly
> > this :-) I'm shortly going to lose access to the machine I've been doing
> > tests on until the New Year, so progress may slow down for a little
> > while on my side.
> > 
> > However there is nothing obvious in the trace. It looks like the write
> > I/O gets pushed out in two chunks, the earlier one containing the "first
> > page" missing block along with other blocks, and the second one is
> > pushed out by the filemap_write_and_wait_range added in the patch I
> > posted yesterday. Both have completed before we send out the read
> > (O_DIRECT) which results in a single I/O to disk - again exactly what
> > I'd expect to see. There are two calls to bmap for the O_DIRECT read,
> > the first one returns short (correctly since we hit EOF) and the second
> > one returns unmapped since it is a request to map the remainder of the
> > file, which is beyond EOF.
> 
> Oh, interesting that there are two bmap calls for the read. Have a
> look at do_direct_IO(), specifically the case where we hit the
> buffer_unmapped/do_holes code:
> 
[snip]
> So, if we hit a hole, we read the inode size to determine if we are
> beyond EOF. If we are beyond EOF, we're done. If not, the block gets
> zeroed, and we move on to the nect block.
> 
> Now, GFS2 does not set DIO_LOCKING, so the direct IO code is not
> holding the i_mutex across the read, which means it is probably
> racing with the buffered write (I'm not sure what locking gfs2 does
> here). Hence if the buffered write completes a page and updates the
> inode size between the point in time that the direct IO read mapped
> a hole and then checked the EOF, it will zero the page rather than
> returning a short read, and then move on to reading the next
> block. If the next block is still in the hole but is beyond EOF it
> will then abort, returning a short read of a zeroed page....
> 
> Cheers,
> 
> Dave.

Sorry for the delay - I have been working on this and I've now figured
out whats really going on here. I'll describe the GFS2 case first, but
it seems that it is something that will affect other fs too, including
local fs most likely.

So we have one thread doing writes to a file, 1M at a time, using
O_DIRECT. Now with GFS2 that results in more or less a fallback to a
buffered write - there is one important difference however, and that is
that the direct i/o page invalidations still occur, just as if it were a
direct i/o write.

Thread two is doing direct i/o reads. When this results in calling
->direct_IO, all is well. GFS2 does all the correct locking to ensure
that this is coherent with the buffered writes. There is a problem
though - in generic_file_aio_read() we have:

       /* coalesce the iovecs and go direct-to-BIO for O_DIRECT */
        if (filp->f_flags & O_DIRECT) {
                loff_t size;
                struct address_space *mapping;
                struct inode *inode;

                mapping = filp->f_mapping;
                inode = mapping->host;
                if (!count)
                        goto out; /* skip atime */
                size = i_size_read(inode);
                if (pos < size) {
                        retval = filemap_write_and_wait_range(mapping, pos,
                                        pos + iov_length(iov, nr_segs) - 1);
                        if (!retval) {
                                retval = mapping->a_ops->direct_IO(READ, iocb,
                                                        iov, pos, nr_segs);
                        }

The important thing here is that it is checking the i_size with no
locking! This means that it can check i_size, find its the same as the
request read position (if the read gets in first before the next block
is written to the file) and then fall back to buffered read.

That buffered read then gets a page which is invalidated by the page
invalidation after the (DIO, but fallen back to buffered) write has
occurred. I added some trace_printk() calls to the read path and it is
fairly clear that is the path thats being taken when the failure occurs.
Also, it explains why, despite the disk having been used and reused many
times, the page that is returned from the failure is always 0, so its
not a page thats been read from disk. It also explains why I've only
ever seen this for the first page of the read, and the rest of the read
is ok.

Changing the test "if (pos < size) {" to "if (1) {" results in zero
failures from the test program, since it takes the ->direct_IO path each
time which then checks the i_size under the glock, in a race free
manner.

So far as I can tell, this can be just as much a problem for local
filesystems, since they may update i_size at any time too. I note that
XFS appears to wrap this bit of code under its iolock, so that probably
explains why XFS doesn't suffer from this issue.

So what is the correct fix... we have several options I think:

 a) Ignore it on the basis that its not important for normal workloads
 b) Simply remove the "if (pos < size) {" test and leave the fs to get
on with it (I know that leaves the btrfs test which also uses i_size
later on, but that doesn't appear to be an issue since its after we've
called ->direct_IO). Is it an issue that this may call an extra
filemap_write_and_wait_range() in some cases where it didn't before? I
suspect not since the extra calls would be for cases where the pages
were beyond the end of file anyway (according to i_size), so a no-op in
most cases.
 c) Try to prevent race between page invalidation in dio write path and
buffered read somehow (how?)
 d) any other solutions?

I'm leaning towards (b) being the best option at the moment, since it
seems simple to do and looks to have no obvious unwanted side effects,

Steve.

--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html