NFS stalls when writing - linux 3.6.x

Florian Pritz <bluewind@xxxxxxx> · Sat, 03 Nov 2012 20:29:11 +0100

Hi,

Long text ahead.

Since I have no idea what to look at/for, I tried to summarise all more
or less relevant information. If you need any more, please tell me.

I've been trying to debug this for days now and might have mixed
something up although I double checked as much as possible while writing
this mail.

# Overview

I've been experiencing stalls when trying to write big-ish files on my
nfs mount for some time (few months) now. Rsync is also somewhat slow,
transferring only like 1 file per second even if the files are only a
few kilobytes in size. Sometimes it also stalls for a few seconds
between files. I hardly run rsync over nfs so can't tell if this might
be normal.

Sadly I don't know when this started happening.

Server and client are both running Arch Linux with linux 3.6.5 and
nfs-utils 1.2.6.

The server is running on a striped raid10 array with 4 disks using the
deadline scheduler and connected via Gbit ethernet. The CPU is an Intel
i3-530 and it has 2GB RAM. The raid10 is part of an LVM which contains
the actual XFS file system exported by nfsd.

At first I assumed a problem with file system, but I switched from ext3
to XFS and still experience the issue. Transferring large amounts
(>80GB) of data over samba + cifs didn't cause any problems so I'm
ruling out network and disks.

# Description

dd if=/dev/zero of=test bs=1M count=8000 (writing a 1GB file is also
enough, sometimes)

Watch the network traffic (with "vnstat -l" or conky) and wait until it
drops from 110MB/s to 0-5MB/s (you might need to run dd multiple times,
wait a few minutes/hours or reboot the server)

top on the server now shows lots of nfsd threads in D state. iostat only
shows the 0-5MB/s of network traffic going to the disk.

A local dd job on the server manages to write 160MB/s while nfsd
continues to hang. Reading from the nfs share while nfsd is hanging is
possible, but has a delay of up to ~20-30 seconds.

After some time the client displays "nfs: server levant not responding,
still trying" in dmesg followed by a "nfs: server levant OK" 0 or more
seconds later (yes, zero). Both messages sometimes appear more than once
at the same time.

Apart from those messages dmesg is clean on either system even after
waiting for a few minutes.

# Environment

## Mount options (from /proc/mounts)

rw,nosuid,nodev,noexec,relatime,vers=4.0,rsize=65536,wsize=65536,
namlen=255,hard,proto=tcp,port=0,timeo=14,retrans=2,sec=sys,
clientaddr=192.168.4.247,local_lock=none,addr=192.168.4.103,user

## /etc/exportsfs -v

/mnt/data/nfs
192.168.4.1/24(rw,wdelay,crossmnt,root_squash,all_squash,no_subtree_check,anonuid=999,anongid=999)

## Programm versions

Those are all the same on both client and server.

acl 2.2.51-2
libgssglue 0.4-1
libevent 2.0.20-1
librpcsecgss 0.19-7
nfs-utils 1.2.6-2
util-linux 2.22.1-2

# Other notes

I tried reproducing the issue with a virtual machine and it somehow
worked, but I'm not really sure if I actually hit the same issue because
the vm sometimes locks up too.

The VM was set up in qemu with one virtio disk which was directly
partioned without the use of mdadm or lvm.

Thank you for reading.

-- 
Florian Pritz

Attachment:
signature.asc

Description: OpenPGP digital signature