Hi, On Mon, 2013-12-23 at 14:00 +1100, Dave Chinner wrote: > On Fri, Dec 20, 2013 at 09:28:44AM +0000, Steven Whitehouse wrote: > > On Fri, 2013-12-20 at 09:44 +1100, Dave Chinner wrote: > > [snip] > > > > Ok, buffered writes explain all those symptoms, and that means what > > > you are seeing is a buffered write/direct IO read race condition. > > > If the first page is missing data from the direct IO read, that > > > implies that the inode size has been updated by the buffered write > > > path, but a direct Io read of zeroes means the data in the page > > > cache has not been flushed to disk. Given that the direct IO read > > > path does a filemap_write_and_wait_range() call before issuing the > > > direct IO, that implies the first page was not flushed to disk by > > > the cache flush. > > > > > > I'm not sure why that may be the case, but that's where I'd > > > start looking..... > > > > > Indeed - I've just spent the last couple of days looking at exactly > > this :-) I'm shortly going to lose access to the machine I've been doing > > tests on until the New Year, so progress may slow down for a little > > while on my side. > > > > However there is nothing obvious in the trace. It looks like the write > > I/O gets pushed out in two chunks, the earlier one containing the "first > > page" missing block along with other blocks, and the second one is > > pushed out by the filemap_write_and_wait_range added in the patch I > > posted yesterday. Both have completed before we send out the read > > (O_DIRECT) which results in a single I/O to disk - again exactly what > > I'd expect to see. There are two calls to bmap for the O_DIRECT read, > > the first one returns short (correctly since we hit EOF) and the second > > one returns unmapped since it is a request to map the remainder of the > > file, which is beyond EOF. > > Oh, interesting that there are two bmap calls for the read. Have a > look at do_direct_IO(), specifically the case where we hit the > buffer_unmapped/do_holes code: > [snip] > So, if we hit a hole, we read the inode size to determine if we are > beyond EOF. If we are beyond EOF, we're done. If not, the block gets > zeroed, and we move on to the nect block. > > Now, GFS2 does not set DIO_LOCKING, so the direct IO code is not > holding the i_mutex across the read, which means it is probably > racing with the buffered write (I'm not sure what locking gfs2 does > here). Hence if the buffered write completes a page and updates the > inode size between the point in time that the direct IO read mapped > a hole and then checked the EOF, it will zero the page rather than > returning a short read, and then move on to reading the next > block. If the next block is still in the hole but is beyond EOF it > will then abort, returning a short read of a zeroed page.... > > Cheers, > > Dave. Sorry for the delay - I have been working on this and I've now figured out whats really going on here. I'll describe the GFS2 case first, but it seems that it is something that will affect other fs too, including local fs most likely. So we have one thread doing writes to a file, 1M at a time, using O_DIRECT. Now with GFS2 that results in more or less a fallback to a buffered write - there is one important difference however, and that is that the direct i/o page invalidations still occur, just as if it were a direct i/o write. Thread two is doing direct i/o reads. When this results in calling ->direct_IO, all is well. GFS2 does all the correct locking to ensure that this is coherent with the buffered writes. There is a problem though - in generic_file_aio_read() we have: /* coalesce the iovecs and go direct-to-BIO for O_DIRECT */ if (filp->f_flags & O_DIRECT) { loff_t size; struct address_space *mapping; struct inode *inode; mapping = filp->f_mapping; inode = mapping->host; if (!count) goto out; /* skip atime */ size = i_size_read(inode); if (pos < size) { retval = filemap_write_and_wait_range(mapping, pos, pos + iov_length(iov, nr_segs) - 1); if (!retval) { retval = mapping->a_ops->direct_IO(READ, iocb, iov, pos, nr_segs); } The important thing here is that it is checking the i_size with no locking! This means that it can check i_size, find its the same as the request read position (if the read gets in first before the next block is written to the file) and then fall back to buffered read. That buffered read then gets a page which is invalidated by the page invalidation after the (DIO, but fallen back to buffered) write has occurred. I added some trace_printk() calls to the read path and it is fairly clear that is the path thats being taken when the failure occurs. Also, it explains why, despite the disk having been used and reused many times, the page that is returned from the failure is always 0, so its not a page thats been read from disk. It also explains why I've only ever seen this for the first page of the read, and the rest of the read is ok. Changing the test "if (pos < size) {" to "if (1) {" results in zero failures from the test program, since it takes the ->direct_IO path each time which then checks the i_size under the glock, in a race free manner. So far as I can tell, this can be just as much a problem for local filesystems, since they may update i_size at any time too. I note that XFS appears to wrap this bit of code under its iolock, so that probably explains why XFS doesn't suffer from this issue. So what is the correct fix... we have several options I think: a) Ignore it on the basis that its not important for normal workloads b) Simply remove the "if (pos < size) {" test and leave the fs to get on with it (I know that leaves the btrfs test which also uses i_size later on, but that doesn't appear to be an issue since its after we've called ->direct_IO). Is it an issue that this may call an extra filemap_write_and_wait_range() in some cases where it didn't before? I suspect not since the extra calls would be for cases where the pages were beyond the end of file anyway (according to i_size), so a no-op in most cases. c) Try to prevent race between page invalidation in dio write path and buffered read somehow (how?) d) any other solutions? I'm leaning towards (b) being the best option at the moment, since it seems simple to do and looks to have no obvious unwanted side effects, Steve. -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html