Re: [nfsv4]nfs client bug

Benny Halevy <bhalevy@xxxxxxxxxxx> · Thu, 30 Jun 2011 18:42:02 +0300

On 2011-06-30 18:35, Trond Myklebust wrote:
> On Thu, 2011-06-30 at 18:13 +0300, Benny Halevy wrote: 
>> On 2011-06-30 17:24, Trond Myklebust wrote:
>>> On Thu, 2011-06-30 at 09:36 -0400, Andy Adamson wrote: 
>>>> On Jun 29, 2011, at 10:32 PM, quanli gui wrote:
>>>>
>>>>> When I use the iperf tools for one client to 4 ds, the network
>>>>> throughput is 890MB/S. It reflect that it is indeed 10GE non-blocking.
>>>>>
>>>>> a. about block size, I use bs=1M when I use dd
>>>>> b. we indeed use the tcp (doesn't the nfsv4 use the tcp defaultly?)
>>>>> c. the jumbo frames is what? how set mtu automatically?
>>>>>
>>>>> Brian, do you have some more tips?
>>>>
>>>> 1) Set the mtu on both the client and the server 10G interface. Sometimes 9000 is too high. My setup uses 8000.
>>>> To set MTU on interface eth0.
>>>>
>>>> % ifconfig eth0 mtu 9000
>>>>
>>>> iperf will report the MTU of the full path between client and server - use it to verify the MTU of the connection.
>>>>
>>>> 2) Increase the # of rpc_slots on the client.
>>>> % echo 128 > /proc/sys/sunrpc/tcp_slot_table_entries
>>>>
>>>> 3) Increase the # of server threads
>>>>
>>>> % echo 128 > /proc/fs/nfsd/threads
>>>> % service nfs restart
>>>>
>>>> 4) Ensure the TCP buffers on both the client and the server are large enough for the TCP window.
>>>> Calculate the required buffer size by pinging the server from the client with the MTU packet size and multiply the round trip time by the interface capacity
>>>>
>>>> % ping -s 9000 server  - say 108 ms average
>>>>
>>>> 10Gbits/sec = 1,250,000,000 Bytes/sec * .108 sec = 135,000,000 bytes
>>>>
>>>> Use this number to set the following: 
>>>> sysctl -w net.core.rmem_max = 135000000
>>>> sysctl -w net.core.wmem_max 135000000
>>>> sysctl -w "net.ipv4.tcp_rmem <first number unchaged> <second unchanged> 135000000"
>>>> sysctl net.ipv4.tcp_wmem  <first number unchaged> <second unchanged> 135000000"
>>>>
>>>> 5) mount with rsize=131072,wsize=131072
>>>
>>> 6) Note that NFS always guarantees that the file is _on_disk_ after
>>> close(), so if you are using 'dd' to test, then you should be using the
>>> 'conv=fsync' flag (i.e 'dd if=/dev/zero of=test count=20k conv=fsync')
>>> in order to obtain a fair comparison between the NFS and local disk
>>> performance. Otherwise, you are comparing NFS and local _pagecache_
>>> performance.
>>
>> FWIW, modern versions of gnu dd (not sure exactly which version changed that)
>> calculate and report throughput after close()ing the output file.
> 
> ...but not after syncing it unless you explicitly request that.
> 
> On most (all?) local filesystems, close() does not imply fsync().

Right.  My point is that for benchmarking NFS, conv=fsync won't show
any noticeable difference. We're in complete agreement that it's required
for benchmarking local file system performance.

Benny

> 
> Trond

--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html