Re: [PATCH v3] vfs: fix a bug when we do some dio reads with append dio writes

Jan Kara <jack@xxxxxxx> · Tue, 17 Dec 2013 17:39:41 +0100

  Hi,

On Tue 17-12-13 15:32:38, Steven Whitehouse wrote:
> On Tue, 2013-12-17 at 15:01 +0100, Jan Kara wrote:
> > On Tue 17-12-13 09:43:42, Steven Whitehouse wrote:
> > > On Mon, 2013-12-16 at 11:01 +0100, Jan Kara wrote:
> > > > On Mon 16-12-13 09:37:34, Steven Whitehouse wrote:
> > > > > Hi,
> > > > > 
> > > > > On Sat, 2013-12-07 at 18:55 +0800, Zheng Liu wrote:
> > > > > > From: Zheng Liu <wenqing.lz@xxxxxxxxxx>
> > > > > > 
> > > > > > Currently we check i_size in buffered read path after we know the page
> > > > > > is update.  But it could return some zero-filled pages to the userspace
> > > > > > when we do some append dio writes.  We could use the following code
> > > > > > snippet to reproduce this problem in a ext2/3/4 filesystem.
> > > > > > 
> > > > > If the page is not uptodate, then neither (potentially) is i_size, since
> > > > > the underlying fs has not had a chance to get its own locks and update
> > > > > the inode size.
> > > >   Hum, this isn't really about page being uptodate or not, is it? It is
> > > > more about the fact that gfs2 updates i_size only from gfs2_readpage()
> > > > function when obtaining inode lock. Or do you know about other filesystem
> > > > having the same problem as gfs2? E.g. ocfs2 updates i_size from
> > > > ocfs2_file_aio_read() before calling generic_file_aio_read(), cannot gfs2
> > > > do something similar?
> > > > 
> > > Well we could do that, but I don't think it makes sense really. The
> > > point of having the page cache is to put it "in front" of the expensive
> > > operations where possible (such as getting cluster locks). I don't think
> > > it makes sense to grab and drop a cluster lock on every call to
> > > generic_file_aio_read() when, for cached pages, we would not even need
> > > to call ->readpage.
> >   Well, but in a cluster filesystem I fail to see how do you want to
> > synchronize read from pagecache on one node with truncate / write on another
> > node. Either the modification has to broadcast to all nodes having the file
> > cached or read has to check whether the data cached in pagecache is still
> > valid. How is gfs2 solving this?
> > 
> 
> The same way that every other page cache based filesystem does it. If
> there is a cached page that is uptodate, then it can be copied to the
> user, and the page lock is used to ensure that the state doesn't change
> under us.
  If a page is uptodate, we don't ever take the page lock (see
do_generic_file_read())...

> If there is no cached page, then the vfs has to ask the
> filesystem to provide one, and then the cluster locks (glocks) are
> taken. Those locks are then cached until another node, or memory
> pressure results in them being dropped.
> 
> GFS2's internal locking via the glock layer will invalidate all cached
> pages before it drops the glock to a mode other than shared or
> exclusive. It will also write back all dirty data when dropping an
> exclusive lock. For dio we use the "deferred" mode which is basically a
> shared mode that is incompatible with the normal shared mode - it allows
> shared caching of metadata, but no caching of data. GFS2 reverts to
> buffered I/O in order to deal with cases where we need to allocate
> blocks, including file extension.
  OK, thanks for explanation. Now I see how you avoid the races I was
thinking about.

> > > The problem appears to be that somehow we are getting a page returned
> > > when it should not be. So I suspect that the correct place to catch that
> > > is in ->readpage(s)
> >   There are two things which aren't ideal:
> > 1) DIO writes and buffered reads can race resulting in reads returning
> > zero-filled pages. Fixing this corner case would slow down the common case
> > so we decided to put this race in the cathegory of "don't do it when it
> > hurts".
> 
> Well if I alter the test program to use buffered reads, then it takes
> much longer to hit the race. With the original O_DIRECT reads, then it
> is not a great surprise that not all the write i/o has actually hit disk
> at the time of the read, since as you say there is no synchronization
> between the reads and writes.
> 
> So that makes sense to me.
> 
> > 2) DIO read falls back to buffered read on failure - this is there to
> > simplify life of filesystems which don't support DIO at all or do not
> > handle some special cases like DIO from holes, not block-aligned DIO etc.
> > 
> > Combination of these two leads to the problem that DIO read can return
> > zero-filled page when racing with DIO write. With enough luck this can
> > even lead to exposure of uninitialized data:
> >   DIO WRITE           BUFFERED READ
> > - allocates blocks
> >                       - sees blocks mapped so it reads data
> > - submits data write
> > - waits for IO completion
> > - extends i_size
> >                       - sees large i_size so hands over data to userspace
> > 
> >   Now without proper serialization of reads & writes this is hard to avoid
> > (we plan range locks for that but that's still blocked by some mm changes).
> > But checking i_size before reading data at least makes the race *much*
> > less serious (you have to have two threads modifying i_size in
> > parallel to read running and since these threads are synchronized on
> > i_mutex, some races become impossible, the ones returning 0s are still
> > possible but harder to trigger).
> > 
> Well I agree with pretty much all of that except that part about adding
> in the extra i_size test. It may make the race harder to hit, but it
> doesn't eliminate it. Instead in means that we'll have to add a bunch of
> extra code into the fast path just to ensure that it is uptodate.
> 
> The reason that the i_size check that is already in
> do_generic_file_read() is ok is that there is a locked page at the time
> of the i_size_read call, as per the comment:
> 
>                 /*
>                  * i_size must be checked after we know the page is Uptodate.
>                  *
>                  * Checking i_size after the check allows us to calculate
>                  * the correct value for "nr", which means the zero-filled
>                  * part of the page is not copied back to userspace (unless
>                  * another truncate extends the file - this is desired though).
>                  */
  Note that the page is *not* locked at the time of the i_size check. And
I'm aware of the comment - however that seems to talk about a case where
truncate() makes the file smaller. If we checked i_size only before having
page uptodate, we could see i_size value before truncate and thus return to
userspace data that is already zeroed-out by block_truncate_page() or its
equivalent. Actually, the race still seems to be there as we could read
i_size before truncate begins and if we take long enough nap just after
that file_read_actor() could copy data already zeroed out by
block_truncate_page(). That just doesn't seem likely to happen in practice.

So you are right that we might be better off to hold page lock during
i_size check and copy to userspace. However that isn't really possible as
that could deadlock - consider an evil program that does

buf = mmap(NULL, 4096, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
read(fd, buf, 4096);

In that case file_read_actor() will hit a page fault when copying data to
userspace and to satisfy the page fault we need the very same page lock
that we would hold over the i_size check. This is the reason why write path
goes through the hoops of ->write_begin, ->write_end, and prefaulting
pages.

> > Now we could check i_size in ->readpage / ->readpages but conceptually that
> > seems as a wrong thing to do since readpage only cares about buffers being
> > mapped. OTOH generic_file_aio_read() et al already handles i_size checks.
> > And GFS2 seems to be the only fs that has problems with stale i_size (OCFS2
> > and NFS both revalidate it in their \.aio_read callbacks). So I'd be
> > happier if we could handle this inside GFS2...
> > 
> I would argue that GFS2 is doing this the right way in this case though
> - it makes no sense to put the i_size check into the fast path, when it
> is only required in the case that there is no cached page (i.e the slow
> path). In addition, putting it in the slow path (where the fs can take
> its own locks) also fixes the race, rather than just making it more
> difficult to hit.
  I disagree here: i_size check is necessary regardless the page being
uptodate or not. For local filesystems truncate can happen completely
independently of the read...

> So far as I can tell the VFS code was originally intended to work in
> exactly this way, however it is not that obvious that this is the case,
> maybe better documentation is required in this case.
  Originally, VFS never though about clustered filesystems so i_size was
always uptodate. We only had to take care about truncate changing i_size
while read was running. And to mitigate problems with that we needed to
check after the page is uptodate.

> If we did add in some GFS2 code to update i_size ahead of
> generic_file_aio_read() then we can only do that by grabbing and
> dropping the glock. That is a relatively expensive operation, and still
> doesn't fix the race as we must drop the glock before calling
> generic_file_aio_read() since we must not hold glocks over the copy to
> user stage.
  Is it that expensive even if you have the glock already cached? Because
if you have cached page, you have to have cached glock as well if I
understand GFS2 design right.

								Honza
-- 
Jan Kara <jack@xxxxxxx>
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html