Hi, On Tue 17-12-13 15:32:38, Steven Whitehouse wrote: > On Tue, 2013-12-17 at 15:01 +0100, Jan Kara wrote: > > On Tue 17-12-13 09:43:42, Steven Whitehouse wrote: > > > On Mon, 2013-12-16 at 11:01 +0100, Jan Kara wrote: > > > > On Mon 16-12-13 09:37:34, Steven Whitehouse wrote: > > > > > Hi, > > > > > > > > > > On Sat, 2013-12-07 at 18:55 +0800, Zheng Liu wrote: > > > > > > From: Zheng Liu <wenqing.lz@xxxxxxxxxx> > > > > > > > > > > > > Currently we check i_size in buffered read path after we know the page > > > > > > is update. But it could return some zero-filled pages to the userspace > > > > > > when we do some append dio writes. We could use the following code > > > > > > snippet to reproduce this problem in a ext2/3/4 filesystem. > > > > > > > > > > > If the page is not uptodate, then neither (potentially) is i_size, since > > > > > the underlying fs has not had a chance to get its own locks and update > > > > > the inode size. > > > > Hum, this isn't really about page being uptodate or not, is it? It is > > > > more about the fact that gfs2 updates i_size only from gfs2_readpage() > > > > function when obtaining inode lock. Or do you know about other filesystem > > > > having the same problem as gfs2? E.g. ocfs2 updates i_size from > > > > ocfs2_file_aio_read() before calling generic_file_aio_read(), cannot gfs2 > > > > do something similar? > > > > > > > Well we could do that, but I don't think it makes sense really. The > > > point of having the page cache is to put it "in front" of the expensive > > > operations where possible (such as getting cluster locks). I don't think > > > it makes sense to grab and drop a cluster lock on every call to > > > generic_file_aio_read() when, for cached pages, we would not even need > > > to call ->readpage. > > Well, but in a cluster filesystem I fail to see how do you want to > > synchronize read from pagecache on one node with truncate / write on another > > node. Either the modification has to broadcast to all nodes having the file > > cached or read has to check whether the data cached in pagecache is still > > valid. How is gfs2 solving this? > > > > The same way that every other page cache based filesystem does it. If > there is a cached page that is uptodate, then it can be copied to the > user, and the page lock is used to ensure that the state doesn't change > under us. If a page is uptodate, we don't ever take the page lock (see do_generic_file_read())... > If there is no cached page, then the vfs has to ask the > filesystem to provide one, and then the cluster locks (glocks) are > taken. Those locks are then cached until another node, or memory > pressure results in them being dropped. > > GFS2's internal locking via the glock layer will invalidate all cached > pages before it drops the glock to a mode other than shared or > exclusive. It will also write back all dirty data when dropping an > exclusive lock. For dio we use the "deferred" mode which is basically a > shared mode that is incompatible with the normal shared mode - it allows > shared caching of metadata, but no caching of data. GFS2 reverts to > buffered I/O in order to deal with cases where we need to allocate > blocks, including file extension. OK, thanks for explanation. Now I see how you avoid the races I was thinking about. > > > The problem appears to be that somehow we are getting a page returned > > > when it should not be. So I suspect that the correct place to catch that > > > is in ->readpage(s) > > There are two things which aren't ideal: > > 1) DIO writes and buffered reads can race resulting in reads returning > > zero-filled pages. Fixing this corner case would slow down the common case > > so we decided to put this race in the cathegory of "don't do it when it > > hurts". > > Well if I alter the test program to use buffered reads, then it takes > much longer to hit the race. With the original O_DIRECT reads, then it > is not a great surprise that not all the write i/o has actually hit disk > at the time of the read, since as you say there is no synchronization > between the reads and writes. > > So that makes sense to me. > > > 2) DIO read falls back to buffered read on failure - this is there to > > simplify life of filesystems which don't support DIO at all or do not > > handle some special cases like DIO from holes, not block-aligned DIO etc. > > > > Combination of these two leads to the problem that DIO read can return > > zero-filled page when racing with DIO write. With enough luck this can > > even lead to exposure of uninitialized data: > > DIO WRITE BUFFERED READ > > - allocates blocks > > - sees blocks mapped so it reads data > > - submits data write > > - waits for IO completion > > - extends i_size > > - sees large i_size so hands over data to userspace > > > > Now without proper serialization of reads & writes this is hard to avoid > > (we plan range locks for that but that's still blocked by some mm changes). > > But checking i_size before reading data at least makes the race *much* > > less serious (you have to have two threads modifying i_size in > > parallel to read running and since these threads are synchronized on > > i_mutex, some races become impossible, the ones returning 0s are still > > possible but harder to trigger). > > > Well I agree with pretty much all of that except that part about adding > in the extra i_size test. It may make the race harder to hit, but it > doesn't eliminate it. Instead in means that we'll have to add a bunch of > extra code into the fast path just to ensure that it is uptodate. > > The reason that the i_size check that is already in > do_generic_file_read() is ok is that there is a locked page at the time > of the i_size_read call, as per the comment: > > /* > * i_size must be checked after we know the page is Uptodate. > * > * Checking i_size after the check allows us to calculate > * the correct value for "nr", which means the zero-filled > * part of the page is not copied back to userspace (unless > * another truncate extends the file - this is desired though). > */ Note that the page is *not* locked at the time of the i_size check. And I'm aware of the comment - however that seems to talk about a case where truncate() makes the file smaller. If we checked i_size only before having page uptodate, we could see i_size value before truncate and thus return to userspace data that is already zeroed-out by block_truncate_page() or its equivalent. Actually, the race still seems to be there as we could read i_size before truncate begins and if we take long enough nap just after that file_read_actor() could copy data already zeroed out by block_truncate_page(). That just doesn't seem likely to happen in practice. So you are right that we might be better off to hold page lock during i_size check and copy to userspace. However that isn't really possible as that could deadlock - consider an evil program that does buf = mmap(NULL, 4096, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0); read(fd, buf, 4096); In that case file_read_actor() will hit a page fault when copying data to userspace and to satisfy the page fault we need the very same page lock that we would hold over the i_size check. This is the reason why write path goes through the hoops of ->write_begin, ->write_end, and prefaulting pages. > > Now we could check i_size in ->readpage / ->readpages but conceptually that > > seems as a wrong thing to do since readpage only cares about buffers being > > mapped. OTOH generic_file_aio_read() et al already handles i_size checks. > > And GFS2 seems to be the only fs that has problems with stale i_size (OCFS2 > > and NFS both revalidate it in their \.aio_read callbacks). So I'd be > > happier if we could handle this inside GFS2... > > > I would argue that GFS2 is doing this the right way in this case though > - it makes no sense to put the i_size check into the fast path, when it > is only required in the case that there is no cached page (i.e the slow > path). In addition, putting it in the slow path (where the fs can take > its own locks) also fixes the race, rather than just making it more > difficult to hit. I disagree here: i_size check is necessary regardless the page being uptodate or not. For local filesystems truncate can happen completely independently of the read... > So far as I can tell the VFS code was originally intended to work in > exactly this way, however it is not that obvious that this is the case, > maybe better documentation is required in this case. Originally, VFS never though about clustered filesystems so i_size was always uptodate. We only had to take care about truncate changing i_size while read was running. And to mitigate problems with that we needed to check after the page is uptodate. > If we did add in some GFS2 code to update i_size ahead of > generic_file_aio_read() then we can only do that by grabbing and > dropping the glock. That is a relatively expensive operation, and still > doesn't fix the race as we must drop the glock before calling > generic_file_aio_read() since we must not hold glocks over the copy to > user stage. Is it that expensive even if you have the glock already cached? Because if you have cached page, you have to have cached glock as well if I understand GFS2 design right. Honza -- Jan Kara <jack@xxxxxxx> SUSE Labs, CR -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html