Re: high latency NFS

Greg Banks <gnb@xxxxxxxxxxxxxxxxx> · Mon, 04 Aug 2008 18:04:34 +1000

J. Bruce Fields wrote:
> You might get more responses from the linux-nfs list (cc'd).
>
> --b.
>
> On Thu, Jul 24, 2008 at 01:11:31PM -0400, Michael Shuey wrote:
>   
>>
>> iozone is reading/writing a file twice the size of memory on the client with 
>> a 32k block size.  I've tried raising this as high as 16 MB, but I still 
>> see around 6 MB/sec reads.
>>     
That won't make a skerrick of difference with wsize=32K.
>> I'm using a 2.6.9 derivative (yes, I'm a RHEL4 fan).  Testing with a stock 
>> 2.6, client and server, is the next order of business.
>>
>> NFS mount is tcp, version 3.  rsize/wsize are 32k.
Try wsize=rsize=1M.
>>   Both client and server 
>> have had tcp_rmem, tcp_wmem, wmem_max, rmem_max, wmem_default, and 
>> rmem_default tuned - tuning values are 12500000 for defaults (and minimum 
>> window sizes), 25000000 for the maximums.  Inefficient, yes, but I'm not 
>> concerned with memory efficiency at the moment.
>>     
You're aware that the server screws these up again, at least for
writes?  There was a long sequence of threads on linux-nfs about this
recently, starting with

http://marc.info/?l=linux-nfs&m=121312415114958&w=2

which is Dean Hildebrand posting a patch to make the knfsd behaviour
tunable.  ToT still looks broken.  I've been using the attached patch (I
believe a similar one was posted later in the thread by Olga
Kornievskaia)  for low-latency high-bandwidth 10ge performance work,
where it doesn't help but doesn't hurt either.  It should help for your
high-latency high-bandwidth case.  Keep your tunings though, one of 
them will be affecting the TCP window scale negotiated at connect time.
>> Both client and server kernels have been modified to provide 
>> larger-than-normal RPC slot tables.  I allow a max of 1024, but I've found 
>> that actually enabling more than 490 entries in /proc causes mount to 
>> complain it can't allocate memory and die.  That was somewhat suprising, 
>> given I had 122 GB of free memory at the time...
>>     
That number is used to size a physically contiguous kmalloc()ed array of
slots.  With a large wsize you don't need such large slot table sizes or
large numbers of nfsds to fill the pipe.

And yes, the default number of nfsds is utterly inadequate.
>> I've also applied a couple patches to allow the NFS readahead to be a 
>> tunable number of RPC slots. 
There's a patch in SLES to do that, which I'd very much like to see that
in kernel.org (Neil?).  The default NFS readahead multiplier value is
pessimal and guarantees worst-case alignment of READ rpcs during
streaming reads, so we tune it from 15 to 16.

-- 
Greg Banks, P.Engineer, SGI Australian Software Group.
The cake is *not* a lie.
I don't speak for SGI.

Index: linux-2.6.16/net/sunrpc/svcsock.c
===================================================================

--- linux-2.6.16.orig/net/sunrpc/svcsock.c	2008-06-16 15:39:01.774672997 +1000
+++ linux-2.6.16/net/sunrpc/svcsock.c	2008-06-16 15:45:06.203421620 +1000
@@ -1157,13 +1159,13 @@ svc_tcp_recvfrom(struct svc_rqst *rqstp)
 		 * particular pool, which provides an upper bound
 		 * on the number of threads which will access the socket.
 		 *
-		 * rcvbuf just needs to be able to hold a few requests.
-		 * Normally they will be removed from the queue 
-		 * as soon a a complete request arrives.
+		 * rcvbuf needs the same room as sndbuf, to allow
+		 * workloads comprising mostly WRITE calls to flow
+		 * at a reasonable fraction of line speed.
 		 */
 		svc_sock_setbufsize(svsk->sk_sock,
 				    (serv->sv_nrthreads+3) * serv->sv_bufsz,
-				    3 * serv->sv_bufsz);
+				    (serv->sv_nrthreads+3) * serv->sv_bufsz);
 
 	svc_sock_clear_data_ready(svsk);