On Fri, 24 Jan 2014 10:11:11 -0700 Trond Myklebust <trond.myklebust@xxxxxxxxxxxxxxx> wrote: > > On Jan 24, 2014, at 8:52, Jeff Layton <jlayton@xxxxxxxxxx> wrote: > > > On Wed, 22 Jan 2014 07:04:09 -0500 > > Jeff Layton <jlayton@xxxxxxxxxx> wrote: > > > >> On Wed, 22 Jan 2014 00:24:14 -0800 > >> Christoph Hellwig <hch@xxxxxxxxxxxxx> wrote: > >> > >>> On Tue, Jan 21, 2014 at 02:21:59PM -0500, Jeff Layton wrote: > >>>> In any case, this helps but it's a little odd. With this patch, you add > >>>> an invalidate_inode_pages2 call prior to doing the DIO. But, you've > >>>> also left in the call to nfs_zap_mapping in the completion codepath. > >>>> > >>>> So now, we shoot down the mapping prior to doing a DIO write, and then > >>>> mark the mapping for invalidation again when the write completes. Was > >>>> that intentional? > >>>> > >>>> It seems a little excessive and might hurt performance in some cases. > >>>> OTOH, if you mix buffered and DIO you're asking for trouble anyway and > >>>> this approach seems to give better cache coherency. > >>> > >>> Thile follows the model implemented and documented in > >>> generic_file_direct_write(). > >>> > >> > >> Ok, thanks. That makes sense, and the problem described in those > >> comments is almost exactly the one I've seen in practice. > >> > >> I'm still not 100% thrilled with the way that the NFS_INO_INVALID_DATA > >> flag is handled, but that really has nothing to do with this patchset. > >> > >> You can add my Tested-by to the set if you like... > >> > > > > (re-sending with Trond's address fixed) > > > > I may have spoken too soon... > > > > This patchset didn't fix the problem once I cranked up the concurrency > > from 100 child tasks to 1000. I think that HCH's patchset makes sense > > and helps narrow the race window some, but the way that > > nfs_revalidate_mapping/nfs_invalidate_mapping work is just racy. > > > > The following patch does seem to fix it however. It's a combination of > > a test patch that Trond gave me a while back and another change to > > serialize the nfs_invalidate_mapping ops. > > > > I think it's a reasonable approach to deal with the problem, but we > > likely have some other areas that will need similar treatment since > > they also check NFS_INO_INVALID_DATA: > > > > nfs_write_pageuptodate > > nfs_readdir_search_for_cookie > > nfs_update_inode > > > > Trond, thoughts? It's not quite ready for merge, but I'd like to get an > > opinion on the basic approach, or whether you have an idea of how > > better handle the races here: > > I think that it is reasonable for nfs_revalidate_mapping, but I don’t see how it is relevant to nfs_update_inode or nfs_write_pageuptodate. > Readdir already has its own locking at the VFS level, so we shouldn’t need to care there. > nfs_write_pageuptodate does this: ---------------8<----------------- if (NFS_I(inode)->cache_validity & (NFS_INO_INVALID_DATA|NFS_INO_REVAL_PAGECACHE)) return false; out: return PageUptodate(page) != 0; ---------------8<----------------- With the proposed patch, NFS_INO_INVALID_DATA would be cleared first and only later would the page be invalidated. So, there's a race window in there where the bit could be cleared but the page flag is still set, even though it's on its way out the cache. So, I think we'd need to do some similar sort of locking in there to make sure that doesn't happen. nfs_update_inode just does this: if (invalid & NFS_INO_INVALID_DATA) nfs_fscache_invalidate(inode); ...again, since we clear the bit first with this patch, I think we have a potential race window there too. We might not see it set in a situation where we would have before. That case is a bit more problematic since we can't sleep to wait on the bitlock there. It might be best to just get rid of that call altogether and move it into nfs_invalidate_mapping. It seems to me that we ought to just handle fscache the same way we do the pagecache when it comes to invalidation. As far as the readdir code goes, I haven't looked as closely at that yet. I just noticed that it checked for NFS_INO_INVALID_DATA. Once we settle the other two cases, I'll give that closer scrutiny. Thanks, -- Jeff Layton <jlayton@xxxxxxxxxx> -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html