Re: [PATCH v3] vfs: fix a bug when we do some dio reads with append dio writes

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



  Hi,

On Tue 17-12-13 09:43:42, Steven Whitehouse wrote:
> On Mon, 2013-12-16 at 11:01 +0100, Jan Kara wrote:
> > On Mon 16-12-13 09:37:34, Steven Whitehouse wrote:
> > > Hi,
> > > 
> > > On Sat, 2013-12-07 at 18:55 +0800, Zheng Liu wrote:
> > > > From: Zheng Liu <wenqing.lz@xxxxxxxxxx>
> > > > 
> > > > Currently we check i_size in buffered read path after we know the page
> > > > is update.  But it could return some zero-filled pages to the userspace
> > > > when we do some append dio writes.  We could use the following code
> > > > snippet to reproduce this problem in a ext2/3/4 filesystem.
> > > > 
> > > If the page is not uptodate, then neither (potentially) is i_size, since
> > > the underlying fs has not had a chance to get its own locks and update
> > > the inode size.
> >   Hum, this isn't really about page being uptodate or not, is it? It is
> > more about the fact that gfs2 updates i_size only from gfs2_readpage()
> > function when obtaining inode lock. Or do you know about other filesystem
> > having the same problem as gfs2? E.g. ocfs2 updates i_size from
> > ocfs2_file_aio_read() before calling generic_file_aio_read(), cannot gfs2
> > do something similar?
> > 
> Well we could do that, but I don't think it makes sense really. The
> point of having the page cache is to put it "in front" of the expensive
> operations where possible (such as getting cluster locks). I don't think
> it makes sense to grab and drop a cluster lock on every call to
> generic_file_aio_read() when, for cached pages, we would not even need
> to call ->readpage.
  Well, but in a cluster filesystem I fail to see how do you want to
synchronize read from pagecache on one node with truncate / write on another
node. Either the modification has to broadcast to all nodes having the file
cached or read has to check whether the data cached in pagecache is still
valid. How is gfs2 solving this?

> The problem appears to be that somehow we are getting a page returned
> when it should not be. So I suspect that the correct place to catch that
> is in ->readpage(s)
  There are two things which aren't ideal:
1) DIO writes and buffered reads can race resulting in reads returning
zero-filled pages. Fixing this corner case would slow down the common case
so we decided to put this race in the cathegory of "don't do it when it
hurts".
2) DIO read falls back to buffered read on failure - this is there to
simplify life of filesystems which don't support DIO at all or do not
handle some special cases like DIO from holes, not block-aligned DIO etc.

Combination of these two leads to the problem that DIO read can return
zero-filled page when racing with DIO write. With enough luck this can
even lead to exposure of uninitialized data:
  DIO WRITE           BUFFERED READ
- allocates blocks
                      - sees blocks mapped so it reads data
- submits data write
- waits for IO completion
- extends i_size
                      - sees large i_size so hands over data to userspace

  Now without proper serialization of reads & writes this is hard to avoid
(we plan range locks for that but that's still blocked by some mm changes).
But checking i_size before reading data at least makes the race *much*
less serious (you have to have two threads modifying i_size in
parallel to read running and since these threads are synchronized on
i_mutex, some races become impossible, the ones returning 0s are still
possible but harder to trigger).

Now we could check i_size in ->readpage / ->readpages but conceptually that
seems as a wrong thing to do since readpage only cares about buffers being
mapped. OTOH generic_file_aio_read() et al already handles i_size checks.
And GFS2 seems to be the only fs that has problems with stale i_size (OCFS2
and NFS both revalidate it in their \.aio_read callbacks). So I'd be
happier if we could handle this inside GFS2...

> > > I suspect that the correct fix would be to implement ->launder_page to
> > > avoid the race that you've identified here, if I've understood it
> > > correctly,
> >   I don't understand how that would work. Can you elaborate a bit? Here the
> > problem is i_size gets extended by DIO write during buffered read (which is
> > a fallback from DIO read) so we return zero-filled page instead of either
> > data filled page or short read. I don't see where launder_page() would come
> > into the picture...
> > 
> > 								Honza
> 
> Ah, sorry. I was thinking the other way around wrt to
> read/write :( However, I still don't think that generic_file_aio_read()
> is the right place to fix this. I note that XFS seems to pass the test
> and it also uses mpage_readpage and mpage_readpages as well as
> generic_file_aio_read() so maybe that is a clue as to where the answer
> lies. GFS2 seems to fail the test in the same way as ext3 at the moment.
  It is a good question why XFS doesn't have the problem. Looking into
the code, XFS doesn't use i_mutex but rather it's private rw-sem for
serialization (XFS_IOLOCK). Even reads hold the semaphore in shared mode
and as the test program is written, DIO reads will hold it in exclusive
mode for a while as well. A guess that provides enough serialization that
the test program doesn't hit the race (although the race still seems
possible).

> The other question that we've not really answered is why it is useful to
> mix buffered and direct i/o to the same file at the same time anyway.
> While I agree that it ought to work, I'm not convinced that its
> enormously useful, which is maybe why it has not been spotted before,
  As I wrote above, mixing buffered and direct IO isn't really supported
(although we should fix the data exposure race). But DIO read vs DIO write
sounds sensible enough to support. And when DIO read falls back to buffered
read, we have problems...

								Honza
-- 
Jan Kara <jack@xxxxxxx>
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html




[Index of Archives]     [Linux Ext4 Filesystem]     [Union Filesystem]     [Filesystem Testing]     [Ceph Users]     [Ecryptfs]     [AutoFS]     [Kernel Newbies]     [Share Photos]     [Security]     [Netfilter]     [Bugtraq]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux Cachefs]     [Reiser Filesystem]     [Linux RAID]     [Samba]     [Device Mapper]     [CEPH Development]
  Powered by Linux