Re: [PATCH/RFC] make nfs_flush_incompatible more forgiving.

"Myklebust, Trond" <Trond.Myklebust@xxxxxxxxxx> · Tue, 10 Sep 2013 14:02:25 +0000

On Tue, 2013-09-10 at 14:42 +1000, NeilBrown wrote:
> On Thu, 5 Sep 2013 03:17:54 +0000 "Myklebust, Trond"
> <Trond.Myklebust@xxxxxxxxxx> wrote:
> 
> > On Thu, 2013-09-05 at 09:29 +1000, NeilBrown wrote:
> > > On Wed, 4 Sep 2013 14:33:35 +0000 "Myklebust, Trond"
> > > <Trond.Myklebust@xxxxxxxxxx> wrote:
> > > 
> > > > On Wed, 2013-09-04 at 11:36 +1000, NeilBrown wrote:
> > > > > Hi,
> > > > >  we have a customer who has noticed a significant regression in NFS
> > > > >  performance between SLES10 (2.6.16+) and SLES11 (3.0+).
> > > > > 
> > > > >  The use case involves two different processes on the same client mmapping a
> > > > >  file and making changes.
> > > > > 
> > > > >  In 2.6.16, all these changes would be cached on the client until a flush or
> > > > >  close or other normal process flushed the changes out - much the same as if
> > > > >  just one process was accessing the file.
> > > > > 
> > > > >  In 3.0 the file is flushed out every time a different process performs an
> > > > >  update. So if they interleave accesses, you get a flush-to-the-server on
> > > > >  every access.
> > > > > 
> > > > >  This is caused by nfs_flush_incompatible.  It deems access through different
> > > > >  file descriptors to be incompatible.  However I don't think this is
> > > > >  justified.  Accessed using different credentials can reasonably be seen as
> > > > >  incompatible(*), but if two processes with identical credentials open the
> > > > >  same file there there should be not need to differentiate between them.
> > > > > 
> > > > >  The patch below changes the test in nfs_flush_incompatible to just test the
> > > > >  credentials (and the dentry, though that might be superfluous - it seems
> > > > >  safer).
> > > > > 
> > > > > (*) In this specific customer's case the process do currently have slightly
> > > > >  different credentials.  In one, the primary gid is included in the list of
> > > > >  auxiliary gids.  In the other the primary gid is excluded.  This is enough
> > > > >  to make the credentials appear different so this patch doesn't help in that
> > > > >  case.  They are looking at being more consistent in their setgroups usage
> > > > >  but it would help them if we only actually compared the credentials that
> > > > >  were important.  In this case there is no important difference.
> > > > >  Would it make sense to treat a more broad credential as compatible with a
> > > > >  more narrow credential?  So a write from a process with a given
> > > > >  uid,gid,grouplist could ultimately be sent to the server with same uid/gid,
> > > > >  but a subset of the group list?
> > > > >  
> > > > > 
> > > > >  Thoughts?
> > > > > 
> > > > > Thanks,
> > > > > NeilBrown
> > > > > 
> > > > > 
> > > > > diff --git a/fs/nfs/write.c b/fs/nfs/write.c
> > > > > index 7816801..b0d4e47 100644
> > > > > --- a/fs/nfs/write.c
> > > > > +++ b/fs/nfs/write.c
> > > > > @@ -860,7 +860,12 @@ int nfs_flush_incompatible(struct file *file, struct page *page)
> > > > >  		if (req == NULL)
> > > > >  			return 0;
> > > > >  		l_ctx = req->wb_lock_context;
> > > > > -		do_flush = req->wb_page != page || req->wb_context != ctx;
> > > > > +		do_flush = req->wb_page != page;
> > > > > +		if (req->wb_context != ctx) {
> > > > > +			do_flush |=
> > > > > +				req->wb_context->dentry != ctx->dentry ||
> > > > > +				req->wb_context->cred != ctx->cred;
> > > > > +		}
> > > > >  		if (l_ctx) {
> > > > >  			do_flush |= l_ctx->lockowner.l_owner != current->files
> > > > >  				|| l_ctx->lockowner.l_pid != current->tgid;
> > > > 
> > > > Hi Neil,
> > > > 
> > > > Won't this cause any write errors to be propagated to the wrong process?
> > > 
> > > I'm not sure what this means.
> > > 
> > > For a local filesystem, the only error that isn't synchronous is EIO and an
> > > application can only be sure of getting that if it uses 'fsync'.  An
> > > application can get EIO from fsync even if it didn't write to the page which
> > > is causing the problem.  All apps should be able to see the EIO if they
> > > bother to fsync.
> > 
> > You are describing a consequence of the complete and utter brokenness of
> > the block layer. The block layer doesn't track who wrote what into the
> > page cache, and so it just passes any write errors to whichever process
> > that first calls filemap_check_errors() for that file.
> 
> I would have said "pragmatic design" rather than "complete ... brokenness".
> 
> If a write has failed then the file is corrupt and every writer deserves to
> know.
> If multiple processes try to fsync the same file, pages that fail for one
> process will likely fail for the next process so each should get told of the
> failure.
> 
> (actually... I wonder if I'm missing something.  If you have a Dirty page
> which results in an Error every time it is written, is there any way for
> it to be removed from the page cache, or will it remain there causing every
> 'fsync' on the file to try to write the same page and get an error?)
> 

It will usually be removed. There is no point in keeping a page that is
getting EACCES due to server-side permission checks, for instance. What
we'll do in that case is throw out the page, and then return EACCES to
the next write() by that process and/or the next fsync().

> > 
> > In NFS, we use the nfs_open_context to track exactly who wrote what, and
> > use that to notify the specific process that wrote the data that failed.
> 
> While this could be seen as an improvement, the extra guarantees are only
> available on NFS, so apps are not likely to depend on them, so the extra value
> doesn't actually help anyone - does it?
> 
> And it has a clear cost: two processes on the one machine can suffer
> significantly degraded performance.
> Is the theoretical improvement in semantics worth that cost.

These are the semantics that we've enforced for the past 15 years. We
even used to have a dedicated filp->f_error just for NFS until we were
able to hide it in the filesystem-specific open context.

Changing those semantics now is not something I'm prepared to do without
a very good reason. You've admitted that your own workload would _not_
benefit from the kind of change that you're advocating, so as far as I'm
concerned that pretty much makes the question moot for now.

> > > For NFS, EDQUOT and ENOSPC are also asynchronous, and they can also come in
> > > through close().  I think every app should still see every error whether  it
> > > dirtied the page which triggered the error or not.
> > 
> > With this patch, we can end up writing the data using the wrong
> > nfs_open_context, in which case the errors will get reported to the
> > process that actually owns that nfs_open_context.
> > 
> 
> I see yes.  I would want the error to be returned to whoever instigated the
> 'flush', not whoever issues the 'write'.

That only makes sense if all processes that call fsync() see the same
errors, which is not the case (see above).

-- 
Trond Myklebust
Linux NFS client maintainer

NetApp
Trond.Myklebust@xxxxxxxxxx
www.netapp.com
��.n��������+%������w��{.n�����{��w���jg��������ݢj����G�������j:+v���w�m������w�������h�����٥