Re: Read/Write NFS I/O performance degraded by FLUSH_STABLE page flushing

Brian R Cowan <brcowan@xxxxxxxxxx> · Thu, 4 Jun 2009 13:42:00 -0400

I've been looking in more detail in the network traces that started all 
this, and doing some additional testing with the 2.6.29 kernel in an 
NFS-only build...

In brief:
1) RHEL 5 generates >3x the network write traffic than RHEL4 when linking 
Samba's smbd.
2) In RHEL 5, Those unnecessary writes are slowed down by the "FILE_SYNC" 
optimization put in place for small writes.
3) That optimization seems to be removed from the kernel somewhere between 
2.6.18 and 2.6.29.
4) Unfortunately the "unnecessary write before read" behavior is still 
present in 2.6.29.

In detail:
In RHEL 5, I see a lot of reads from offset {whatever} *immediately* 
preceded by a write to *the same offset*. This is obviously a bad thing, 
now the trick is finding out where it is coming from. The 
write-before-read behavior is happening on the smbd file itself (not 
surprising since that's the only file we're writing in this test...). This 
happens with every 2.6.18 and later kernel I've tested to date.

In RHEL 5, most of the writes are FILE_SYNC writes, which appear to take 
something on the order of 10ms to come back. When using a 2.6.29 kernel, 
the TOTAL time for the write+commit rpc set (write rpc, write reply, 
commit rpc, commit reply), to come back is something like 2ms. I guess the 
NFS servers aren't handling FILE_SYNC writes very well. in 2.6.29, ALL the 
write calls appear to be unstable writes, in RHEL5, most are FILE_SYNC 
writes. (Network traces available upon request.)

Neither is quite as fast as RHEL 4, because the link under RHEL 4 only 
puts about 150 WRITE rpc's on the wire. RHEL 5 generates more than 500 
when building on NFS, and 2.6.29 puts about 340 write rpc's, plus a 
similar number of COMMITs, on the wire. 

The bottom line:
* If someone can help me find where 2.6 stopped setting small writes to 
FILE_SYNC, I'd appreciate it. It would save me time walking through >50 
commitdiffs in gitweb...
* Is this the correct place to start discussing the annoying 
write-before-almost-every-read behavior that 2.6.18 picked up and 2.6.29 
continues? 

=================================================================
Brian Cowan
Advisory Software Engineer
ClearCase Customer Advocacy Group (CAG)
Rational Software
IBM Software Group
81 Hartwell Ave
Lexington, MA

Phone: 1.781.372.3580
Web: http://www.ibm.com/software/rational/support/

Please be sure to update your PMR using ESR at 
http://www-306.ibm.com/software/support/probsub.html or cc all 
correspondence to sw_support@xxxxxxxxxx to be sure your PMR is updated in 
case I am not available.

From:
Trond Myklebust <trond.myklebust@xxxxxxxxxx>
To:
Carlos Carvalho <carlos@xxxxxxxxxxxxxx>
Cc:
linux-nfs@xxxxxxxxxxxxxxx
Date:
06/03/2009 01:10 PM
Subject:
Re: Read/Write NFS I/O performance degraded by FLUSH_STABLE page flushing
Sent by:
linux-nfs-owner@xxxxxxxxxxxxxxx

On Wed, 2009-06-03 at 13:22 -0300, Carlos Carvalho wrote:
> Trond Myklebust (trond.myklebust@xxxxxxxxxx) wrote on 2 June 2009 13:27:
>  >Write gathering relies on waiting an arbitrary length of time in order
>  >to see if someone is going to send another write. The protocol offers 
no
>  >guidance as to how long that wait should be, and so (at least on the
>  >Linux server) we've coded in a hard wait of 10ms if and only if we see
>  >that something else has the file open for writing.
>  >One problem with the Linux implementation is that the "something else"
>  >could be another nfs server thread that happens to be in nfsd_write(),
>  >however it could also be another open NFSv4 stateid, or a NLM lock, or 
a
>  >local process that has the file open for writing.
>  >Another problem is that the nfs server keeps a record of the last file
>  >that was accessed, and also waits if it sees you are writing again to
>  >that same file. Of course it has no idea if this is truly a parallel
>  >write, or if it just happens that you are writing again to the same 
file
>  >using O_SYNC...
> 
> I think the decision to write or wait doesn't belong to the nfs
> server; it should just send the writes immediately. It's up to the
> fs/block/device layers to do the gathering. I understand that the
> client should try to do the gathering before sending the request to
> the wire

This isn't something that we've just pulled out of a hat. It dates back
to pre-NFSv3 times, when every write had to be synchronously committed
to disk before the RPC call could return.

See, for instance,

http://books.google.com/books?id=y9GgPhjyOUwC&pg=PA243&lpg=PA243&dq=What
+is+nfs+write
+gathering&source=bl&ots=M8s0XS2SLd&sig=ctmxQrpII2_Ti4czgpGZrF9mmds&hl=en&ei=Xa0mSrLMC8iptgfSsqHsBg&sa=X&oi=book_result&ct=result&resnum=3

The point is that while it is a good idea for NFSv2, we have much better
methods of dealing with multiple writes in NFSv3 and v4...

Trond

--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html