On Tue, 2013-09-10 at 14:42 +1000, NeilBrown wrote: > On Thu, 5 Sep 2013 03:17:54 +0000 "Myklebust, Trond" > <Trond.Myklebust@xxxxxxxxxx> wrote: > > > On Thu, 2013-09-05 at 09:29 +1000, NeilBrown wrote: > > > On Wed, 4 Sep 2013 14:33:35 +0000 "Myklebust, Trond" > > > <Trond.Myklebust@xxxxxxxxxx> wrote: > > > > > > > On Wed, 2013-09-04 at 11:36 +1000, NeilBrown wrote: > > > > > Hi, > > > > > we have a customer who has noticed a significant regression in NFS > > > > > performance between SLES10 (2.6.16+) and SLES11 (3.0+). > > > > > > > > > > The use case involves two different processes on the same client mmapping a > > > > > file and making changes. > > > > > > > > > > In 2.6.16, all these changes would be cached on the client until a flush or > > > > > close or other normal process flushed the changes out - much the same as if > > > > > just one process was accessing the file. > > > > > > > > > > In 3.0 the file is flushed out every time a different process performs an > > > > > update. So if they interleave accesses, you get a flush-to-the-server on > > > > > every access. > > > > > > > > > > This is caused by nfs_flush_incompatible. It deems access through different > > > > > file descriptors to be incompatible. However I don't think this is > > > > > justified. Accessed using different credentials can reasonably be seen as > > > > > incompatible(*), but if two processes with identical credentials open the > > > > > same file there there should be not need to differentiate between them. > > > > > > > > > > The patch below changes the test in nfs_flush_incompatible to just test the > > > > > credentials (and the dentry, though that might be superfluous - it seems > > > > > safer). > > > > > > > > > > (*) In this specific customer's case the process do currently have slightly > > > > > different credentials. In one, the primary gid is included in the list of > > > > > auxiliary gids. In the other the primary gid is excluded. This is enough > > > > > to make the credentials appear different so this patch doesn't help in that > > > > > case. They are looking at being more consistent in their setgroups usage > > > > > but it would help them if we only actually compared the credentials that > > > > > were important. In this case there is no important difference. > > > > > Would it make sense to treat a more broad credential as compatible with a > > > > > more narrow credential? So a write from a process with a given > > > > > uid,gid,grouplist could ultimately be sent to the server with same uid/gid, > > > > > but a subset of the group list? > > > > > > > > > > > > > > > Thoughts? > > > > > > > > > > Thanks, > > > > > NeilBrown > > > > > > > > > > > > > > > diff --git a/fs/nfs/write.c b/fs/nfs/write.c > > > > > index 7816801..b0d4e47 100644 > > > > > --- a/fs/nfs/write.c > > > > > +++ b/fs/nfs/write.c > > > > > @@ -860,7 +860,12 @@ int nfs_flush_incompatible(struct file *file, struct page *page) > > > > > if (req == NULL) > > > > > return 0; > > > > > l_ctx = req->wb_lock_context; > > > > > - do_flush = req->wb_page != page || req->wb_context != ctx; > > > > > + do_flush = req->wb_page != page; > > > > > + if (req->wb_context != ctx) { > > > > > + do_flush |= > > > > > + req->wb_context->dentry != ctx->dentry || > > > > > + req->wb_context->cred != ctx->cred; > > > > > + } > > > > > if (l_ctx) { > > > > > do_flush |= l_ctx->lockowner.l_owner != current->files > > > > > || l_ctx->lockowner.l_pid != current->tgid; > > > > > > > > Hi Neil, > > > > > > > > Won't this cause any write errors to be propagated to the wrong process? > > > > > > I'm not sure what this means. > > > > > > For a local filesystem, the only error that isn't synchronous is EIO and an > > > application can only be sure of getting that if it uses 'fsync'. An > > > application can get EIO from fsync even if it didn't write to the page which > > > is causing the problem. All apps should be able to see the EIO if they > > > bother to fsync. > > > > You are describing a consequence of the complete and utter brokenness of > > the block layer. The block layer doesn't track who wrote what into the > > page cache, and so it just passes any write errors to whichever process > > that first calls filemap_check_errors() for that file. > > I would have said "pragmatic design" rather than "complete ... brokenness". > > If a write has failed then the file is corrupt and every writer deserves to > know. > If multiple processes try to fsync the same file, pages that fail for one > process will likely fail for the next process so each should get told of the > failure. > > (actually... I wonder if I'm missing something. If you have a Dirty page > which results in an Error every time it is written, is there any way for > it to be removed from the page cache, or will it remain there causing every > 'fsync' on the file to try to write the same page and get an error?) > It will usually be removed. There is no point in keeping a page that is getting EACCES due to server-side permission checks, for instance. What we'll do in that case is throw out the page, and then return EACCES to the next write() by that process and/or the next fsync(). > > > > In NFS, we use the nfs_open_context to track exactly who wrote what, and > > use that to notify the specific process that wrote the data that failed. > > While this could be seen as an improvement, the extra guarantees are only > available on NFS, so apps are not likely to depend on them, so the extra value > doesn't actually help anyone - does it? > > And it has a clear cost: two processes on the one machine can suffer > significantly degraded performance. > Is the theoretical improvement in semantics worth that cost. These are the semantics that we've enforced for the past 15 years. We even used to have a dedicated filp->f_error just for NFS until we were able to hide it in the filesystem-specific open context. Changing those semantics now is not something I'm prepared to do without a very good reason. You've admitted that your own workload would _not_ benefit from the kind of change that you're advocating, so as far as I'm concerned that pretty much makes the question moot for now. > > > For NFS, EDQUOT and ENOSPC are also asynchronous, and they can also come in > > > through close(). I think every app should still see every error whether it > > > dirtied the page which triggered the error or not. > > > > With this patch, we can end up writing the data using the wrong > > nfs_open_context, in which case the errors will get reported to the > > process that actually owns that nfs_open_context. > > > > I see yes. I would want the error to be returned to whoever instigated the > 'flush', not whoever issues the 'write'. That only makes sense if all processes that call fsync() see the same errors, which is not the case (see above). -- Trond Myklebust Linux NFS client maintainer NetApp Trond.Myklebust@xxxxxxxxxx www.netapp.com ��.n��������+%������w��{.n�����{��w���jg��������ݢj����G�������j:+v���w�m������w�������h�����٥