Re: nfs-backed mmap file results in 1000s of WRITEs per second

"Myklebust, Trond" <Trond.Myklebust@xxxxxxxxxx> · Sat, 7 Sep 2013 15:00:50 +0000

On Sat, 2013-09-07 at 10:51 -0400, Jeff Layton wrote:
> On Fri, 6 Sep 2013 11:48:45 -0500
> Quentin Barnes <qbarnes@xxxxxxxxx> wrote:
> 
> > Jeff, can your try out my test program in the base note on your
> > RHEL5.9 or later RHEL5.x kernels?
> > 
> > I reverified that running the test on a 2.6.18-348.16.1.el5 x86_64
> > kernel (latest released RHEL5.9) does not show the problem for me.
> > Based on what you and Trond have said in this thread though, I'm
> > really curious why it doesn't have the problem.
> > 
> > On Fri, Sep 6, 2013 at 8:36 AM, Jeff Layton <jlayton@xxxxxxxxxx> wrote:
> > > On Thu, 5 Sep 2013 17:34:20 -0500
> > > Quentin Barnes <qbarnes@xxxxxxxxx> wrote:
> > >
> > >> On Thu, Sep 05, 2013 at 09:57:24PM +0000, Myklebust, Trond wrote:
> > >> > On Thu, 2013-09-05 at 16:36 -0500, Quentin Barnes wrote:
> > >> > > On Thu, Sep 05, 2013 at 08:02:01PM +0000, Myklebust, Trond wrote:
> > >> > > > On Thu, 2013-09-05 at 14:11 -0500, Quentin Barnes wrote:
> > >> > > > > On Thu, Sep 05, 2013 at 12:03:03PM -0500, Malahal Naineni wrote:
> > >> > > > > > Neil Brown posted a patch couple days ago for this!
> > >> > > > > >
> > >> > > > > > http://thread.gmane.org/gmane.linux.nfs/58473
> > >> > > > >
> > >> > > > > I tried Neil's patch on a v3.11 kernel.  The rebuilt kernel still
> > >> > > > > exhibited the same 1000s of WRITEs/sec problem.
> > >> > > > >
> > >> > > > > Any other ideas?
> > >> > > >
> > >> > > > Yes. Please try the attached patch.
> > >> > >
> > >> > > Great!  That did the trick!
> > >> > >
> > >> > > Do you feel this patch could be worthy of pushing it upstream in its
> > >> > > current state or was it just to verify a theory?
> > >> > >
> > >> > >
> > >> > > In comparing the nfs_flush_incompatible() implementations between
> > >> > > RHEL5 and v3.11 (without your patch), the guts of the algorithm seem
> > >> > > more or less logically equivalent to me on whether or not to flush
> > >> > > the page.  Also, when and where nfs_flush_incompatible() is invoked
> > >> > > seems the same.  Would you provide a very brief pointer to clue me
> > >> > > in as to why this problem didn't also manifest circa 2.6.18 days?
> > >> >
> > >> > There was no nfs_vm_page_mkwrite() to handle page faults in the 2.6.18
> > >> > days, and so the risk was that your mmapped writes could end up being
> > >> > sent with the wrong credentials.
> > >>
> > >> Ah!  You're right that nfs_vm_page_mkwrite() was missing from
> > >> the original 2.6.18, so that makes sense, however, Red Hat had
> > >> backported that function starting with their RHEL5.9(*) kernels,
> > >> yet the problem doesn't manifest on RHEL5.9.  Maybe the answer lies
> > >> somewhere in RHEL5.9's do_wp_page(), or up that call path, but
> > >> glancing through it, it all looks pretty close though.
> > >>
> > >>
> > >> (*) That was the source I using when comparing with the 3.11 source
> > >> when studying your patch since it was the last kernel known to me
> > >> without the problem.
> > >>
> > >
> > > I'm pretty sure RHEL5 has a similar problem, but it's unclear to me why
> > > you're not seeing it there. I have a RHBZ open vs. RHEL5 but it's marked
> > > private at the moment (I'll see about opening it up). I brought this up
> > > upstream about a year ago with this strawman patch:
> > >
> > >     http://article.gmane.org/gmane.linux.nfs/51240
> > >
> > > ...at the time Trond said he was working on a set of patches to track
> > > the open/lock stateid on a per-req basis. Did that approach not pan
> > > out?
> > >
> > > Also, do you need to do a similar fix to nfs_can_coalesce_requests?
> > >
> 
> Yes, I see the same behavior you do. With a recent kernel I see a ton
> of WRITE requests go out, with RHEL5 hardly any.
> 
> I guess I'm a little confused as to the reverse question. Why are we
> seeing this data get flushed out so quickly in recent kernels from just
> changes to the mmaped pages?
> 
> My understanding has always been that when a page is cleaned, we set
> the WP bit on it, and then when it goes dirty we clear it and also
> call page_mkwrite (not necessarily in that order).
> 
> So here we have two processes that mmap the same page, and then are
> furiously writing to it. The kernel shouldn't really care or be aware
> of that thrashing until that page gets flushed out for some reason
> (msync() call or VM pressure).

fork() is not supposed to share page tables between parent and child
process. Shouldn't that also imply that the page write protect bits are
not shared?

IOW: a write protect page fault in the parent process that sets
page_mkwrite() should not prevent a similar write protect page fault in
the child process (and subsequent call to page_mkwrite()).

...or is my understanding of the page fault semantics wrong?

> IOW, RHEL5 behaves the way I'd expect. What's unclear to me is why more
> recent kernels don't behave that way.

-- 
Trond Myklebust
Linux NFS client maintainer

NetApp
Trond.Myklebust@xxxxxxxxxx
www.netapp.com
��.n��������+%������w��{.n�����{��w���jg��������ݢj����G�������j:+v���w�m������w�������h�����٥