Re: [PATCH 7/7] nfs: page cache invalidation for dio

Trond Myklebust <trond.myklebust@xxxxxxxxxxxxxxx> · Fri, 24 Jan 2014 10:40:06 -0700

On Fri, 2014-01-24 at 12:29 -0500, Jeff Layton wrote:
> On Fri, 24 Jan 2014 10:11:11 -0700
> Trond Myklebust <trond.myklebust@xxxxxxxxxxxxxxx> wrote:
> 
> > 
> > On Jan 24, 2014, at 8:52, Jeff Layton <jlayton@xxxxxxxxxx> wrote:
> > 
> > > On Wed, 22 Jan 2014 07:04:09 -0500
> > > Jeff Layton <jlayton@xxxxxxxxxx> wrote:
> > > 
> > >> On Wed, 22 Jan 2014 00:24:14 -0800
> > >> Christoph Hellwig <hch@xxxxxxxxxxxxx> wrote:
> > >> 
> > >>> On Tue, Jan 21, 2014 at 02:21:59PM -0500, Jeff Layton wrote:
> > >>>> In any case, this helps but it's a little odd. With this patch, you add
> > >>>> an invalidate_inode_pages2 call prior to doing the DIO. But, you've
> > >>>> also left in the call to nfs_zap_mapping in the completion codepath.
> > >>>> 
> > >>>> So now, we shoot down the mapping prior to doing a DIO write, and then
> > >>>> mark the mapping for invalidation again when the write completes. Was
> > >>>> that intentional?
> > >>>> 
> > >>>> It seems a little excessive and might hurt performance in some cases.
> > >>>> OTOH, if you mix buffered and DIO you're asking for trouble anyway and
> > >>>> this approach seems to give better cache coherency.
> > >>> 
> > >>> Thile follows the model implemented and documented in
> > >>> generic_file_direct_write().
> > >>> 
> > >> 
> > >> Ok, thanks. That makes sense, and the problem described in those
> > >> comments is almost exactly the one I've seen in practice.
> > >> 
> > >> I'm still not 100% thrilled with the way that the NFS_INO_INVALID_DATA
> > >> flag is handled, but that really has nothing to do with this patchset.
> > >> 
> > >> You can add my Tested-by to the set if you like...
> > >> 
> > > 
> > > (re-sending with Trond's address fixed)
> > > 
> > > I may have spoken too soon...
> > > 
> > > This patchset didn't fix the problem once I cranked up the concurrency
> > > from 100 child tasks to 1000. I think that HCH's patchset makes sense
> > > and helps narrow the race window some, but the way that
> > > nfs_revalidate_mapping/nfs_invalidate_mapping work is just racy.
> > > 
> > > The following patch does seem to fix it however. It's a combination of
> > > a test patch that Trond gave me a while back and another change to
> > > serialize the nfs_invalidate_mapping ops.
> > > 
> > > I think it's a reasonable approach to deal with the problem, but we
> > > likely have some other areas that will need similar treatment since
> > > they also check NFS_INO_INVALID_DATA: 
> > > 
> > >    nfs_write_pageuptodate
> > >    nfs_readdir_search_for_cookie
> > >    nfs_update_inode
> > > 
> > > Trond, thoughts? It's not quite ready for merge, but I'd like to get an
> > > opinion on the basic approach, or whether you have an idea of how
> > > better handle the races here:
> > 
> > I think that it is reasonable for nfs_revalidate_mapping, but I don’t see how it is relevant to nfs_update_inode or nfs_write_pageuptodate.
> > Readdir already has its own locking at the VFS level, so we shouldn’t need to care there.
> > 
> 
> 
> nfs_write_pageuptodate does this:
> 
> ---------------8<-----------------
>         if (NFS_I(inode)->cache_validity & (NFS_INO_INVALID_DATA|NFS_INO_REVAL_PAGECACHE))
>                 return false;
> out:
>         return PageUptodate(page) != 0;
> ---------------8<-----------------
> 
> With the proposed patch, NFS_INO_INVALID_DATA would be cleared first and
> only later would the page be invalidated. So, there's a race window in
> there where the bit could be cleared but the page flag is still set,
> even though it's on its way out the cache. So, I think we'd need to do
> some similar sort of locking in there to make sure that doesn't happen.

We _cannot_ lock against nfs_revalidate_mapping() here, because we could
end up deadlocking with invalidate_inode_pages2().

If you like, we could add a test for NFS_INO_INVALIDATING, to turn off
the optimisation in that case, but I'd like to understand what the race
would be: don't forget that the page is marked as PageUptodate(), which
means that either invalidate_inode_pages2() has not yet reached this
page, or that a read of the page succeeded after the invalidation was
made.

> nfs_update_inode just does this:
> 
>         if (invalid & NFS_INO_INVALID_DATA)
>                 nfs_fscache_invalidate(inode);
> 
> ...again, since we clear the bit first with this patch, I think we have
> a potential race window there too. We might not see it set in a
> situation where we would have before. That case is a bit more
> problematic since we can't sleep to wait on the bitlock there.

Umm... That test in nfs_update_inode() is there because we might just
have _set_ the NFS_INO_INVALID_DATA bit.

> 
> It might be best to just get rid of that call altogether and move it
> into nfs_invalidate_mapping. It seems to me that we ought to just
> handle fscache the same way we do the pagecache when it comes to
> invalidation.
> 
> As far as the readdir code goes, I haven't looked as closely at that
> yet. I just noticed that it checked for NFS_INO_INVALID_DATA. Once we
> settle the other two cases, I'll give that closer scrutiny.
> 
> Thanks,

-- 
Trond Myklebust
Linux NFS client maintainer

--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html