Re: [PATCH v3] vfs: fix a bug when we do some dio reads with append dio writes

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Fri, Dec 20, 2013 at 09:28:44AM +0000, Steven Whitehouse wrote:
> On Fri, 2013-12-20 at 09:44 +1100, Dave Chinner wrote:

[snip]

> > Ok, buffered writes explain all those symptoms, and that means what
> > you are seeing is a buffered write/direct IO read race condition.
> > If the first page is missing data from the direct IO read, that
> > implies that the inode size has been updated by the buffered write
> > path, but a direct Io read of zeroes means the data in the page
> > cache has not been flushed to disk. Given that the direct IO read
> > path does a filemap_write_and_wait_range() call before issuing the
> > direct IO, that implies the first page was not flushed to disk by
> > the cache flush.
> > 
> > I'm not sure why that may be the case, but that's where I'd
> > start looking.....
> > 
> Indeed - I've just spent the last couple of days looking at exactly
> this :-) I'm shortly going to lose access to the machine I've been doing
> tests on until the New Year, so progress may slow down for a little
> while on my side.
> 
> However there is nothing obvious in the trace. It looks like the write
> I/O gets pushed out in two chunks, the earlier one containing the "first
> page" missing block along with other blocks, and the second one is
> pushed out by the filemap_write_and_wait_range added in the patch I
> posted yesterday. Both have completed before we send out the read
> (O_DIRECT) which results in a single I/O to disk - again exactly what
> I'd expect to see. There are two calls to bmap for the O_DIRECT read,
> the first one returns short (correctly since we hit EOF) and the second
> one returns unmapped since it is a request to map the remainder of the
> file, which is beyond EOF.

Oh, interesting that there are two bmap calls for the read. Have a
look at do_direct_IO(), specifically the case where we hit the
buffer_unmapped/do_holes code:

do_holes:
			/* Handle holes */
			if (!buffer_mapped(map_bh)) {
				loff_t i_size_aligned;

				/* AKPM: eargh, -ENOTBLK is a hack */
				if (dio->rw & WRITE) {
					page_cache_release(page);
					return -ENOTBLK;
				}

				/*
				 * Be sure to account for a partial block as the
				 * last block in the file
				 */
				i_size_aligned = ALIGN(i_size_read(dio->inode),
							1 << blkbits);
				if (sdio->block_in_file >=
						i_size_aligned >> blkbits) {
					/* We hit eof */
					page_cache_release(page);
					goto out;
				}
				zero_user(page, block_in_page << blkbits,
						1 << blkbits);
				sdio->block_in_file++;
				block_in_page++;
				goto next_block;
			}

So, if we hit a hole, we read the inode size to determine if we are
beyond EOF. If we are beyond EOF, we're done. If not, the block gets
zeroed, and we move on to the nect block.

Now, GFS2 does not set DIO_LOCKING, so the direct IO code is not
holding the i_mutex across the read, which means it is probably
racing with the buffered write (I'm not sure what locking gfs2 does
here). Hence if the buffered write completes a page and updates the
inode size between the point in time that the direct IO read mapped
a hole and then checked the EOF, it will zero the page rather than
returning a short read, and then move on to reading the next
block. If the next block is still in the hole but is beyond EOF it
will then abort, returning a short read of a zeroed page....

Cheers,

Dave.
-- 
Dave Chinner
david@xxxxxxxxxxxxx
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html




[Index of Archives]     [Linux Ext4 Filesystem]     [Union Filesystem]     [Filesystem Testing]     [Ceph Users]     [Ecryptfs]     [AutoFS]     [Kernel Newbies]     [Share Photos]     [Security]     [Netfilter]     [Bugtraq]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux Cachefs]     [Reiser Filesystem]     [Linux RAID]     [Samba]     [Device Mapper]     [CEPH Development]
  Powered by Linux