Re: [PATCH v3] vfs: fix a bug when we do some dio reads with append dio writes

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi,

On Tue, 2013-12-17 at 15:01 +0100, Jan Kara wrote:
> Hi,
> 
> On Tue 17-12-13 09:43:42, Steven Whitehouse wrote:
> > On Mon, 2013-12-16 at 11:01 +0100, Jan Kara wrote:
> > > On Mon 16-12-13 09:37:34, Steven Whitehouse wrote:
> > > > Hi,
> > > > 
> > > > On Sat, 2013-12-07 at 18:55 +0800, Zheng Liu wrote:
> > > > > From: Zheng Liu <wenqing.lz@xxxxxxxxxx>
> > > > > 
> > > > > Currently we check i_size in buffered read path after we know the page
> > > > > is update.  But it could return some zero-filled pages to the userspace
> > > > > when we do some append dio writes.  We could use the following code
> > > > > snippet to reproduce this problem in a ext2/3/4 filesystem.
> > > > > 
> > > > If the page is not uptodate, then neither (potentially) is i_size, since
> > > > the underlying fs has not had a chance to get its own locks and update
> > > > the inode size.
> > >   Hum, this isn't really about page being uptodate or not, is it? It is
> > > more about the fact that gfs2 updates i_size only from gfs2_readpage()
> > > function when obtaining inode lock. Or do you know about other filesystem
> > > having the same problem as gfs2? E.g. ocfs2 updates i_size from
> > > ocfs2_file_aio_read() before calling generic_file_aio_read(), cannot gfs2
> > > do something similar?
> > > 
> > Well we could do that, but I don't think it makes sense really. The
> > point of having the page cache is to put it "in front" of the expensive
> > operations where possible (such as getting cluster locks). I don't think
> > it makes sense to grab and drop a cluster lock on every call to
> > generic_file_aio_read() when, for cached pages, we would not even need
> > to call ->readpage.
>   Well, but in a cluster filesystem I fail to see how do you want to
> synchronize read from pagecache on one node with truncate / write on another
> node. Either the modification has to broadcast to all nodes having the file
> cached or read has to check whether the data cached in pagecache is still
> valid. How is gfs2 solving this?
> 

The same way that every other page cache based filesystem does it. If
there is a cached page that is uptodate, then it can be copied to the
user, and the page lock is used to ensure that the state doesn't change
under us. If there is no cached page, then the vfs has to ask the
filesystem to provide one, and then the cluster locks (glocks) are
taken. Those locks are then cached until another node, or memory
pressure results in them being dropped.

GFS2's internal locking via the glock layer will invalidate all cached
pages before it drops the glock to a mode other than shared or
exclusive. It will also write back all dirty data when dropping an
exclusive lock. For dio we use the "deferred" mode which is basically a
shared mode that is incompatible with the normal shared mode - it allows
shared caching of metadata, but no caching of data. GFS2 reverts to
buffered I/O in order to deal with cases where we need to allocate
blocks, including file extension.

> > The problem appears to be that somehow we are getting a page returned
> > when it should not be. So I suspect that the correct place to catch that
> > is in ->readpage(s)
>   There are two things which aren't ideal:
> 1) DIO writes and buffered reads can race resulting in reads returning
> zero-filled pages. Fixing this corner case would slow down the common case
> so we decided to put this race in the cathegory of "don't do it when it
> hurts".

Well if I alter the test program to use buffered reads, then it takes
much longer to hit the race. With the original O_DIRECT reads, then it
is not a great surprise that not all the write i/o has actually hit disk
at the time of the read, since as you say there is no synchronization
between the reads and writes.

So that makes sense to me.

> 2) DIO read falls back to buffered read on failure - this is there to
> simplify life of filesystems which don't support DIO at all or do not
> handle some special cases like DIO from holes, not block-aligned DIO etc.
> 
> Combination of these two leads to the problem that DIO read can return
> zero-filled page when racing with DIO write. With enough luck this can
> even lead to exposure of uninitialized data:
>   DIO WRITE           BUFFERED READ
> - allocates blocks
>                       - sees blocks mapped so it reads data
> - submits data write
> - waits for IO completion
> - extends i_size
>                       - sees large i_size so hands over data to userspace
> 
>   Now without proper serialization of reads & writes this is hard to avoid
> (we plan range locks for that but that's still blocked by some mm changes).
> But checking i_size before reading data at least makes the race *much*
> less serious (you have to have two threads modifying i_size in
> parallel to read running and since these threads are synchronized on
> i_mutex, some races become impossible, the ones returning 0s are still
> possible but harder to trigger).
> 
Well I agree with pretty much all of that except that part about adding
in the extra i_size test. It may make the race harder to hit, but it
doesn't eliminate it. Instead in means that we'll have to add a bunch of
extra code into the fast path just to ensure that it is uptodate.

The reason that the i_size check that is already in
do_generic_file_read() is ok is that there is a locked page at the time
of the i_size_read call, as per the comment:

                /*
                 * i_size must be checked after we know the page is Uptodate.
                 *
                 * Checking i_size after the check allows us to calculate
                 * the correct value for "nr", which means the zero-filled
                 * part of the page is not copied back to userspace (unless
                 * another truncate extends the file - this is desired though).
                 */


> Now we could check i_size in ->readpage / ->readpages but conceptually that
> seems as a wrong thing to do since readpage only cares about buffers being
> mapped. OTOH generic_file_aio_read() et al already handles i_size checks.
> And GFS2 seems to be the only fs that has problems with stale i_size (OCFS2
> and NFS both revalidate it in their \.aio_read callbacks). So I'd be
> happier if we could handle this inside GFS2...
> 
I would argue that GFS2 is doing this the right way in this case though
- it makes no sense to put the i_size check into the fast path, when it
is only required in the case that there is no cached page (i.e the slow
path). In addition, putting it in the slow path (where the fs can take
its own locks) also fixes the race, rather than just making it more
difficult to hit.

So far as I can tell the VFS code was originally intended to work in
exactly this way, however it is not that obvious that this is the case,
maybe better documentation is required in this case.

If we did add in some GFS2 code to update i_size ahead of
generic_file_aio_read() then we can only do that by grabbing and
dropping the glock. That is a relatively expensive operation, and still
doesn't fix the race as we must drop the glock before calling
generic_file_aio_read() since we must not hold glocks over the copy to
user stage.

Incidentally, for write calls, we do exactly that to update the i_size,
but only in the case that O_APPEND has been used, since we don't need to
do it until we get to write_begin/end otherwise.

I'm going to do a bit more investigation though, as I've not yet managed
to convince myself of exactly where the problem lies.

> > > > I suspect that the correct fix would be to implement ->launder_page to
> > > > avoid the race that you've identified here, if I've understood it
> > > > correctly,
> > >   I don't understand how that would work. Can you elaborate a bit? Here the
> > > problem is i_size gets extended by DIO write during buffered read (which is
> > > a fallback from DIO read) so we return zero-filled page instead of either
> > > data filled page or short read. I don't see where launder_page() would come
> > > into the picture...
> > > 
> > > 								Honza
> > 
> > Ah, sorry. I was thinking the other way around wrt to
> > read/write :( However, I still don't think that generic_file_aio_read()
> > is the right place to fix this. I note that XFS seems to pass the test
> > and it also uses mpage_readpage and mpage_readpages as well as
> > generic_file_aio_read() so maybe that is a clue as to where the answer
> > lies. GFS2 seems to fail the test in the same way as ext3 at the moment.
>   It is a good question why XFS doesn't have the problem. Looking into
> the code, XFS doesn't use i_mutex but rather it's private rw-sem for
> serialization (XFS_IOLOCK). Even reads hold the semaphore in shared mode
> and as the test program is written, DIO reads will hold it in exclusive
> mode for a while as well. A guess that provides enough serialization that
> the test program doesn't hit the race (although the race still seems
> possible).
> 
Yes, that sounds quite likely.

> > The other question that we've not really answered is why it is useful to
> > mix buffered and direct i/o to the same file at the same time anyway.
> > While I agree that it ought to work, I'm not convinced that its
> > enormously useful, which is maybe why it has not been spotted before,
>   As I wrote above, mixing buffered and direct IO isn't really supported
> (although we should fix the data exposure race). But DIO read vs DIO write
> sounds sensible enough to support. And when DIO read falls back to buffered
> read, we have problems...
> 
> 								Honza

Yes, agreed. I was really asking just to see if there was a use case
where this mixed use does make sense, as I couldn't immediately see why
someone would want to do that in the first place,

Steve.



--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html




[Index of Archives]     [Linux Ext4 Filesystem]     [Union Filesystem]     [Filesystem Testing]     [Ceph Users]     [Ecryptfs]     [AutoFS]     [Kernel Newbies]     [Share Photos]     [Security]     [Netfilter]     [Bugtraq]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux Cachefs]     [Reiser Filesystem]     [Linux RAID]     [Samba]     [Device Mapper]     [CEPH Development]
  Powered by Linux