On Tue, Dec 17, 2013 at 12:17:44PM +0000, Steven Whitehouse wrote: > Hi, > > On Tue, 2013-12-17 at 19:16 +0800, Zheng Liu wrote: > > Hi Steven, > > > > On Tue, Dec 17, 2013 at 09:43:42AM +0000, Steven Whitehouse wrote: > > > Hi, > > > > > > On Mon, 2013-12-16 at 11:01 +0100, Jan Kara wrote: > > > > On Mon 16-12-13 09:37:34, Steven Whitehouse wrote: > > > > > Hi, > > > > > > > > > > On Sat, 2013-12-07 at 18:55 +0800, Zheng Liu wrote: > > > > > > From: Zheng Liu <wenqing.lz@xxxxxxxxxx> > > > > > > > > > > > > Currently we check i_size in buffered read path after we know the page > > > > > > is update. But it could return some zero-filled pages to the userspace > > > > > > when we do some append dio writes. We could use the following code > > > > > > snippet to reproduce this problem in a ext2/3/4 filesystem. > > > > > > > > > > > If the page is not uptodate, then neither (potentially) is i_size, since > > > > > the underlying fs has not had a chance to get its own locks and update > > > > > the inode size. > > > > Hum, this isn't really about page being uptodate or not, is it? It is > > > > more about the fact that gfs2 updates i_size only from gfs2_readpage() > > > > function when obtaining inode lock. Or do you know about other filesystem > > > > having the same problem as gfs2? E.g. ocfs2 updates i_size from > > > > ocfs2_file_aio_read() before calling generic_file_aio_read(), cannot gfs2 > > > > do something similar? > > > > > > > Well we could do that, but I don't think it makes sense really. The > > > point of having the page cache is to put it "in front" of the expensive > > > operations where possible (such as getting cluster locks). I don't think > > > it makes sense to grab and drop a cluster lock on every call to > > > generic_file_aio_read() when, for cached pages, we would not even need > > > to call ->readpage. > > > > > > The problem appears to be that somehow we are getting a page returned > > > when it should not be. So I suspect that the correct place to catch that > > > is in ->readpage(s) > > > > > > > > I suspect that the correct fix would be to implement ->launder_page to > > > > > avoid the race that you've identified here, if I've understood it > > > > > correctly, > > > > I don't understand how that would work. Can you elaborate a bit? Here the > > > > problem is i_size gets extended by DIO write during buffered read (which is > > > > a fallback from DIO read) so we return zero-filled page instead of either > > > > data filled page or short read. I don't see where launder_page() would come > > > > into the picture... > > > > > > > > Honza > > > > > > Ah, sorry. I was thinking the other way around wrt to > > > read/write :( However, I still don't think that generic_file_aio_read() > > > is the right place to fix this. I note that XFS seems to pass the test > > > and it also uses mpage_readpage and mpage_readpages as well as > > > generic_file_aio_read() so maybe that is a clue as to where the answer > > > lies. GFS2 seems to fail the test in the same way as ext3 at the moment. > > > > Yes, xfs can pass the test under direct IO. I suspect that the reason > > is that xfs uses ilock/iolock. But it doesn't means that all file > > systems need to fix this issue by themselves because the root cause is > > in vfs layer. > > > I'm not convinced that this is the case though. The VFS is asking the fs > to provide a page via ->readpage(s) and therefore the fs should not be > providing a page (presumably marked uptodate in this case) where one > does not already exist, and if it does exist, then the content of the > page should be correct. As far as I understand, ->readpage(s) should return zero-filled pages if this page doesn't have a block mapping because the content should be '0'. > > I don't see how this test can be done at a point before the fs has a > chance to get its own locks, and thus ensuring that the inode size is > actually correct. > > However, I'd like to understand what is going on at ->readpage level, > and thus why we get this page generated incorrectly, and what actual i/o > gets generated (or not) in the problematic case. > > > > > > > The other question that we've not really answered is why it is useful to > > > mix buffered and direct i/o to the same file at the same time anyway. > > > While I agree that it ought to work, I'm not convinced that its > > > enormously useful, which is maybe why it has not been spotted before, > > > > Yes, mixed buffered read and direct write seems to be useless. But in > > our product system, the issue will appear under direct read/write. The > > application has a reader and a writer. The reader should read nothing > > or the content that has been written by writer. But now the reader gets > > the zero-filled page. Hence the application needs to prevent this issue > > using a mutex or other flag. It doesn't make sense to the userspace > > programmer. > > > I'm not sure I fully understand what you are saying here... if you see > the same issue using only O_DIRECT reads and writes, how are you getting > that? Sorry, maybe I don't clarify the test program. In commit log the test program does direct read/write. I found this problem because the app developer asks me why his application doesn't work well undert direct read/write. > > > TBH, we have found this issue two year ago but we don't take care of it. > > So currently the application just solves it as I said above. But it will > > issue a flush calling fsync(2) to flush the dirty page in order to make > > sure the data has been flushed into the disk. Obviously, the performance > > of this application is impacted by this issue. Now we have known that we > > don't need to flush the dirty page but I think we'd better fix it. > > > > Regards, > > - Zheng > > Calling fsync is a separate issue though, since that is required anyway > if you want to be sure that the data has actually reached the disk, Yes, calling fsync is another issue. I just want to say that this issue has impacted the application. At least in our product system. Thanks, - Zheng -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html