On Fri, 24 Jan 2014 10:40:06 -0700 Trond Myklebust <trond.myklebust@xxxxxxxxxxxxxxx> wrote: > On Fri, 2014-01-24 at 12:29 -0500, Jeff Layton wrote: > > On Fri, 24 Jan 2014 10:11:11 -0700 > > Trond Myklebust <trond.myklebust@xxxxxxxxxxxxxxx> wrote: > > > > > > > > On Jan 24, 2014, at 8:52, Jeff Layton <jlayton@xxxxxxxxxx> wrote: > > > > > > > On Wed, 22 Jan 2014 07:04:09 -0500 > > > > Jeff Layton <jlayton@xxxxxxxxxx> wrote: > > > > > > > >> On Wed, 22 Jan 2014 00:24:14 -0800 > > > >> Christoph Hellwig <hch@xxxxxxxxxxxxx> wrote: > > > >> > > > >>> On Tue, Jan 21, 2014 at 02:21:59PM -0500, Jeff Layton wrote: > > > >>>> In any case, this helps but it's a little odd. With this patch, you add > > > >>>> an invalidate_inode_pages2 call prior to doing the DIO. But, you've > > > >>>> also left in the call to nfs_zap_mapping in the completion codepath. > > > >>>> > > > >>>> So now, we shoot down the mapping prior to doing a DIO write, and then > > > >>>> mark the mapping for invalidation again when the write completes. Was > > > >>>> that intentional? > > > >>>> > > > >>>> It seems a little excessive and might hurt performance in some cases. > > > >>>> OTOH, if you mix buffered and DIO you're asking for trouble anyway and > > > >>>> this approach seems to give better cache coherency. > > > >>> > > > >>> Thile follows the model implemented and documented in > > > >>> generic_file_direct_write(). > > > >>> > > > >> > > > >> Ok, thanks. That makes sense, and the problem described in those > > > >> comments is almost exactly the one I've seen in practice. > > > >> > > > >> I'm still not 100% thrilled with the way that the NFS_INO_INVALID_DATA > > > >> flag is handled, but that really has nothing to do with this patchset. > > > >> > > > >> You can add my Tested-by to the set if you like... > > > >> > > > > > > > > (re-sending with Trond's address fixed) > > > > > > > > I may have spoken too soon... > > > > > > > > This patchset didn't fix the problem once I cranked up the concurrency > > > > from 100 child tasks to 1000. I think that HCH's patchset makes sense > > > > and helps narrow the race window some, but the way that > > > > nfs_revalidate_mapping/nfs_invalidate_mapping work is just racy. > > > > > > > > The following patch does seem to fix it however. It's a combination of > > > > a test patch that Trond gave me a while back and another change to > > > > serialize the nfs_invalidate_mapping ops. > > > > > > > > I think it's a reasonable approach to deal with the problem, but we > > > > likely have some other areas that will need similar treatment since > > > > they also check NFS_INO_INVALID_DATA: > > > > > > > > nfs_write_pageuptodate > > > > nfs_readdir_search_for_cookie > > > > nfs_update_inode > > > > > > > > Trond, thoughts? It's not quite ready for merge, but I'd like to get an > > > > opinion on the basic approach, or whether you have an idea of how > > > > better handle the races here: > > > > > > I think that it is reasonable for nfs_revalidate_mapping, but I don’t see how it is relevant to nfs_update_inode or nfs_write_pageuptodate. > > > Readdir already has its own locking at the VFS level, so we shouldn’t need to care there. > > > > > > > > > nfs_write_pageuptodate does this: > > > > ---------------8<----------------- > > if (NFS_I(inode)->cache_validity & (NFS_INO_INVALID_DATA|NFS_INO_REVAL_PAGECACHE)) > > return false; > > out: > > return PageUptodate(page) != 0; > > ---------------8<----------------- > > > > With the proposed patch, NFS_INO_INVALID_DATA would be cleared first and > > only later would the page be invalidated. So, there's a race window in > > there where the bit could be cleared but the page flag is still set, > > even though it's on its way out the cache. So, I think we'd need to do > > some similar sort of locking in there to make sure that doesn't happen. > > We _cannot_ lock against nfs_revalidate_mapping() here, because we could > end up deadlocking with invalidate_inode_pages2(). > > If you like, we could add a test for NFS_INO_INVALIDATING, to turn off > the optimisation in that case, but I'd like to understand what the race > would be: don't forget that the page is marked as PageUptodate(), which > means that either invalidate_inode_pages2() has not yet reached this > page, or that a read of the page succeeded after the invalidation was > made. > Right. The first situation seems wrong to me. We've marked the file as INVALID and then cleared the bit to start the process of invalidating the actual pages. It seems like nfs_write_pageuptodate ought not return true even if PageUptodate() is still set at that point. We could check NFS_INO_INVALIDATING, but we might miss that optimization in a lot of cases just because something happens to be in nfs_revalidate_mapping. Maybe that means that this bitlock isn't sufficient and we need some other mechanism. I'm not sure what that should be though. > > nfs_update_inode just does this: > > > > if (invalid & NFS_INO_INVALID_DATA) > > nfs_fscache_invalidate(inode); > > > > ...again, since we clear the bit first with this patch, I think we have > > a potential race window there too. We might not see it set in a > > situation where we would have before. That case is a bit more > > problematic since we can't sleep to wait on the bitlock there. > > Umm... That test in nfs_update_inode() is there because we might just > have _set_ the NFS_INO_INVALID_DATA bit. > Correct. But do we need to force a fscache invalidation at that point, or can it wait until we're going to invalidate the mapping too? > > > > It might be best to just get rid of that call altogether and move it > > into nfs_invalidate_mapping. It seems to me that we ought to just > > handle fscache the same way we do the pagecache when it comes to > > invalidation. > > > > As far as the readdir code goes, I haven't looked as closely at that > > yet. I just noticed that it checked for NFS_INO_INVALID_DATA. Once we > > settle the other two cases, I'll give that closer scrutiny. > > -- Jeff Layton <jlayton@xxxxxxxxxx> -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html