Hi all, Sorry for the delay because of holiday and other issues. On Tue, Jan 14, 2014 at 08:19:01PM +0100, Jan Kara wrote: [...] > > Sorry for the delay - I have been working on this and I've now figured > > out whats really going on here. I'll describe the GFS2 case first, but > > it seems that it is something that will affect other fs too, including > > local fs most likely. > > > > So we have one thread doing writes to a file, 1M at a time, using > > O_DIRECT. Now with GFS2 that results in more or less a fallback to a > > buffered write - there is one important difference however, and that is > > that the direct i/o page invalidations still occur, just as if it were a > > direct i/o write. > > > > Thread two is doing direct i/o reads. When this results in calling > > ->direct_IO, all is well. GFS2 does all the correct locking to ensure > > that this is coherent with the buffered writes. There is a problem > > though - in generic_file_aio_read() we have: > > > > /* coalesce the iovecs and go direct-to-BIO for O_DIRECT */ > > if (filp->f_flags & O_DIRECT) { > > loff_t size; > > struct address_space *mapping; > > struct inode *inode; > > > > mapping = filp->f_mapping; > > inode = mapping->host; > > if (!count) > > goto out; /* skip atime */ > > size = i_size_read(inode); > > if (pos < size) { > > retval = filemap_write_and_wait_range(mapping, pos, > > pos + iov_length(iov, nr_segs) - 1); > > if (!retval) { > > retval = mapping->a_ops->direct_IO(READ, iocb, > > iov, pos, nr_segs); > > } > > > > The important thing here is that it is checking the i_size with no > > locking! This means that it can check i_size, find its the same as the > > request read position (if the read gets in first before the next block > > is written to the file) and then fall back to buffered read. > > > > That buffered read then gets a page which is invalidated by the page > > invalidation after the (DIO, but fallen back to buffered) write has > > occurred. I added some trace_printk() calls to the read path and it is > > fairly clear that is the path thats being taken when the failure occurs. > > Also, it explains why, despite the disk having been used and reused many > > times, the page that is returned from the failure is always 0, so its > > not a page thats been read from disk. It also explains why I've only > > ever seen this for the first page of the read, and the rest of the read > > is ok. > Yes, race like this is something Zheng and I have been talking about from > the beginning. Thanks for tracking down the details. > > > Changing the test "if (pos < size) {" to "if (1) {" results in zero > > failures from the test program, since it takes the ->direct_IO path each > > time which then checks the i_size under the glock, in a race free > > manner. > > > > So far as I can tell, this can be just as much a problem for local > > filesystems, since they may update i_size at any time too. I note that > > XFS appears to wrap this bit of code under its iolock, so that probably > > explains why XFS doesn't suffer from this issue. > > > > So what is the correct fix... we have several options I think: > > > > a) Ignore it on the basis that its not important for normal workloads > Zheng has a real world usecase which hits this race and it does look > rather normal. So don't like a) very much. Yes, as far as I know, at least there are two applications that has met this problem in our product system at Taobao. So that is why I dig into this bug and hope it can be fixed. > > > b) Simply remove the "if (pos < size) {" test and leave the fs to get > > on with it (I know that leaves the btrfs test which also uses i_size > > later on, but that doesn't appear to be an issue since its after we've > > called ->direct_IO). Is it an issue that this may call an extra > > filemap_write_and_wait_range() in some cases where it didn't before? I > > suspect not since the extra calls would be for cases where the pages > > were beyond the end of file anyway (according to i_size), so a no-op in > > most cases. I try this approach and I can confirm that it is ok for ext4. Dave, I have send out a patch for xfstests to add a new testcase according to this bug but I don't get any response [1]. Could you please review it? Thanks. 1. http://comments.gmane.org/gmane.comp.file-systems.xfs.general/58428 Regards, - Zheng -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html