Hi, On Fri, 2013-12-20 at 09:44 +1100, Dave Chinner wrote: > On Thu, Dec 19, 2013 at 12:27:53PM +0000, Steven Whitehouse wrote: > > Hi, > > > > Sorry for the delay... this has turned out to be a somewhat more > > complicated investigation than I'd first expected. In fact there are > > still a few things I don't yet understand, however I thought I'd let you > > know how I'm getting on in the mean time. > > > > So I started looking at GFS2, since thats what I'm most familiar with. > > I've found a couple of bugs which I'm about to post patches for, > > although even with those patches GFS2 doesn't pass the test all the > > time, although it does get through the test some of the time, and it > > does last longer than ext4. > > > > Since I wondered whether I was just lucky running the test against XFS, > > I've run it again several times, and I still have not seen a single > > failure on XFS. > > IF this is a failure due to a buffered IO fallback from the direct > IO path, then XFS will never fail because it doesn't ever fall back > to buffered IO. i.e. XFS *always* does direct IO. > > Looking at the test code, the appending direct IO write is > effectively a single 1MB IO (i.e. one atomic i_size update after it > completes). Hence if the filesystem doesn't fall back to buffered IO > for appending writes, then the direct IO reads should never read > data between the old EOF and the new EOF until after the new EOF is > reflected in i_size. > Well GFS2 does fall back, not so sure about ext4. > > In order to gain a bit more information about the problem, I added a few > > more things to the printf, and in particular I note that under GFS2 I > > see ret (the amount of data read) at various different sizes. > > That implies multiple i_size updates during the write, which implies > buffered IO for the writes. > Yes, thats exactly what I'd expect to see. I've been tracing whats going on using a mixture of trace_printk and existing trace points. > > On ext4, > > ret is always the full 1M buffer size. So I assume that is the > > difference which your patch was intended to cure. > > > > However, I also printed out j, the offset where the first error occurs, > > and in both the ext4 and gfs2 cases, that offset is 0, and after exactly > > 4096 bytes, there is an 'a'. I'd have expected to see a number of pages > > of 'a' followed by zero pages, but instead, I'm seeing a single zero > > page followed by at least one 'a'. I've not extended my instrumentation > > to print out a list of which pages are zero and which 'a', but that is > > an unexpected result. > > i.e. the write is being done page by page rather than in chunks > limited by the size of a bio. Again, that implies that buffered > writes, not direct IO writes. > > > Some tracing shows that with the additional GFS2 patches, the data does > > get written to disk correctly, ahead of the read which is issued. Also > > since the test does allocating writes, GFS2 will fall back to buffered > > I/O for that, and only the read is direct I/O, so since we see similar > > results in the GFS2 and ext4 cases, this missing first page which is > > common to both looks like it might be related to the read side of > > things. > > Ok, buffered writes explain all those symptoms, and that means what > you are seeing is a buffered write/direct IO read race condition. > If the first page is missing data from the direct IO read, that > implies that the inode size has been updated by the buffered write > path, but a direct Io read of zeroes means the data in the page > cache has not been flushed to disk. Given that the direct IO read > path does a filemap_write_and_wait_range() call before issuing the > direct IO, that implies the first page was not flushed to disk by > the cache flush. > > I'm not sure why that may be the case, but that's where I'd > start looking..... > Indeed - I've just spent the last couple of days looking at exactly this :-) I'm shortly going to lose access to the machine I've been doing tests on until the New Year, so progress may slow down for a little while on my side. However there is nothing obvious in the trace. It looks like the write I/O gets pushed out in two chunks, the earlier one containing the "first page" missing block along with other blocks, and the second one is pushed out by the filemap_write_and_wait_range added in the patch I posted yesterday. Both have completed before we send out the read (O_DIRECT) which results in a single I/O to disk - again exactly what I'd expect to see. There are two calls to bmap for the O_DIRECT read, the first one returns short (correctly since we hit EOF) and the second one returns unmapped since it is a request to map the remainder of the file, which is beyond EOF. That all looks correct to me, and I can't see a difference between the working and broken cases. I did also try the obvious experiment of changing the filemap_write_and_wait_range into a plain filemap_write_and_wait and that made no difference. I also checked that after the error occurs, that the file does land up correctly written, with all bytes set to 'a'. So definitely a race somewhere, rather than a permanent condition. I think I may also have interested Lucas Czerner in taking a look from the ext4 perspective. > > I'm attaching my updated version of your test program, which I'd like to > > add to our test suite in due course, if you have no objections. > > IIRC, there's a patch to add it to xfstests. > > Cheers, > > Dave. Ok, in which case we can easily use it in xfstests. Thanks for the heads up, Steve. -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html