Re: [PATCH v3] vfs: fix a bug when we do some dio reads with append dio writes

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Thu, Dec 19, 2013 at 12:27:53PM +0000, Steven Whitehouse wrote:
> Hi,
> 
> Sorry for the delay... this has turned out to be a somewhat more
> complicated investigation than I'd first expected. In fact there are
> still a few things I don't yet understand, however I thought I'd let you
> know how I'm getting on in the mean time.
> 
> So I started looking at GFS2, since thats what I'm most familiar with.
> I've found a couple of bugs which I'm about to post patches for,
> although even with those patches GFS2 doesn't pass the test all the
> time, although it does get through the test some of the time, and it
> does last longer than ext4.
> 
> Since I wondered whether I was just lucky running the test against XFS,
> I've run it again several times, and I still have not seen a single
> failure on XFS.

IF this is a failure due to a buffered IO fallback from the direct
IO path, then XFS will never fail because it doesn't ever fall back
to buffered IO. i.e. XFS *always* does direct IO.

Looking at the test code, the appending direct IO write is
effectively a single 1MB IO (i.e. one atomic i_size update after it
completes). Hence if the filesystem doesn't fall back to buffered IO
for appending writes, then the direct IO reads should never read
data between the old EOF and the new EOF until after the new EOF is
reflected in i_size.

> In order to gain a bit more information about the problem, I added a few
> more things to the printf, and in particular I note that under GFS2 I
> see ret (the amount of data read) at various different sizes.

That implies multiple i_size updates during the write, which implies
buffered IO for the writes.

> On ext4,
> ret is always the full 1M buffer size. So I assume that is the
> difference which your patch was intended to cure.
>
> However, I also printed out j, the offset where the first error occurs,
> and in both the ext4 and gfs2 cases, that offset is 0, and after exactly
> 4096 bytes, there is an 'a'. I'd have expected to see a number of pages
> of 'a' followed by zero pages, but instead, I'm seeing a single zero
> page followed by at least one 'a'. I've not extended my instrumentation
> to print out a list of which pages are zero and which 'a', but that is
> an unexpected result.

i.e. the write is being done page by page rather than in chunks
limited by the size of a bio. Again, that implies that buffered
writes, not direct IO writes.

> Some tracing shows that with the additional GFS2 patches, the data does
> get written to disk correctly, ahead of the read which is issued. Also
> since the test does allocating writes, GFS2 will fall back to buffered
> I/O for that, and only the read is direct I/O, so since we see similar
> results in the GFS2 and ext4 cases, this missing first page which is
> common to both looks like it might be related to the read side of
> things.

Ok, buffered writes explain all those symptoms, and that means what
you are seeing is a buffered write/direct IO read race condition.
If the first page is missing data from the direct IO read, that
implies that the inode size has been updated by the buffered write
path, but a direct Io read of zeroes means the data in the page
cache has not been flushed to disk. Given that the direct IO read
path does a filemap_write_and_wait_range() call before issuing the
direct IO, that implies the first page was not flushed to disk by
the cache flush.

I'm not sure why that may be the case, but that's where I'd
start looking.....

> I'm attaching my updated version of your test program, which I'd like to
> add to our test suite in due course, if you have no objections.

IIRC, there's a patch to add it to xfstests.

Cheers,

Dave.
-- 
Dave Chinner
david@xxxxxxxxxxxxx
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html




[Index of Archives]     [Linux Ext4 Filesystem]     [Union Filesystem]     [Filesystem Testing]     [Ceph Users]     [Ecryptfs]     [AutoFS]     [Kernel Newbies]     [Share Photos]     [Security]     [Netfilter]     [Bugtraq]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux Cachefs]     [Reiser Filesystem]     [Linux RAID]     [Samba]     [Device Mapper]     [CEPH Development]
  Powered by Linux