Re: Files leak from nfsd in 4.7.1-rc1 (and more?)

Oleg Drokin <green@xxxxxxxxxxxxxx> · Tue, 7 Jun 2016 13:30:15 -0400

On Jun 7, 2016, at 1:10 PM, Jeff Layton wrote:

> On Tue, 2016-06-07 at 11:37 -0400, Oleg Drokin wrote:
>> Hello!
>> 
>>    I've been trying to better understand this problem I was having where sometimes
>>    a formerly NFS-exported mountpoint becomes unmountable (after nfsd stop).
>> 
>>    I finally traced it to a leaked filedescriptor that was allocated from
>>    nfsd4_open()->nfsd4_process_open2()->nfs4_get_vfs_file()->nfsd_open().
>> 
>>    Also together with it we see leaked credentials allocated along the same path from
>>    fh_verify() and groups allocated from svcauth_unix_accept()->groups_alloc() that
>>    are presumably used by the credentials.
>> 
>>    Unfortunately I was not able to make total sense out of the state handling in nfsd,
>>    but it's clear that one of the file descriptors inside struct nfs4_file is
>>    lost. I added a patch like this (always a good idea, so surprised it was not
>>    there already):
>> @@ -271,6 +274,9 @@ static void nfsd4_free_file_rcu(struct rcu_head *rcu)
>>  {
>>         struct nfs4_file *fp = container_of(rcu, struct nfs4_file, fi_rcu);
>>  
>> +       WARN_ON(fp->fi_fds[0]);
>> +       WARN_ON(fp->fi_fds[1]);
>> +       WARN_ON(fp->fi_fds[2]);
>>         kmem_cache_free(file_slab, fp);
>>  }
>> 
>>    And when the problem is hit, I am also triggering (Always this one which is fd[1])
>> [ 3588.143002] ------------[ cut here ]------------
>> [ 3588.143662] WARNING: CPU: 5 PID: 9 at /home/green/bk/linux/fs/nfsd/nfs4state.c:278 nfsd4_free_file_rcu+0x65/0x80 [nfsd]
>> [ 3588.144947] Modules linked in: loop rpcsec_gss_krb5 joydev acpi_cpufreq tpm_tis i2c_piix4 tpm virtio_console pcspkr nfsd ttm drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops drm floppy serio_raw virtio_blk
>> [ 3588.147135] CPU: 5 PID: 9 Comm: rcuos/0 Not tainted 4.7.0-rc1-vm-nfs+ #120
>> [ 3588.153826] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
>> [ 3588.153830]  0000000000000286 00000000e2d5ccdf ffff88011965bd50 ffffffff814a11a5
>> [ 3588.153832]  0000000000000000 0000000000000000 ffff88011965bd90 ffffffff8108806b
>> [ 3588.153834]  0000011600000000 ffff8800c476a0b8 ffff8800c476a048 ffffffffc0110fc0
>> [ 3588.153834] Call Trace:
>> [ 3588.153839]  [] dump_stack+0x86/0xc1
>> [ 3588.153841]  [] __warn+0xcb/0xf0
>> [ 3588.153852]  [] ? trace_raw_output_fh_want_write+0x60/0x60 [nfsd]
>> [ 3588.153853]  [] warn_slowpath_null+0x1d/0x20
>> [ 3588.153859]  [] nfsd4_free_file_rcu+0x65/0x80 [nfsd]
>> [ 3588.153861]  [] rcu_nocb_kthread+0x335/0x510
>> [ 3588.153862]  [] ? rcu_nocb_kthread+0x27f/0x510
>> [ 3588.153863]  [] ? rcu_cpu_notify+0x3e0/0x3e0
>> [ 3588.153866]  [] kthread+0x101/0x120
>> [ 3588.153868]  [] ? trace_hardirqs_on_caller+0xf4/0x1b0
>> [ 3588.153871]  [] ret_from_fork+0x1f/0x40
>> [ 3588.153872]  [] ? kthread_create_on_node+0x250/0x250
>> 
>> 
>>   release_all_access() seems to be doing correct job of all that cleaning, so
>>   there must be some other path that I do not quite see.
>> 
>>   Hopefully you are more familiar with the code and can see the problem right away ;)
> 
> Hmm...well I'm not seeing it right offhand, and haven't been able to
> reproduce the problem so far after a couple of attempts by hand. What
> sort of workload are you running before you see that warning pop?

Well, the workload is a crazed twisty conflicting operations set in a very limited
namespace (called racer.sh,
http://git.whamcloud.com/fs/lustrerelease.git/tree/refs/heads/master:/lustre/tests/racer )

It does all sorts of crazy unimaginable stuff that is hard to predict, and typically I
run it on several mountpoints all pointing to the same spot
(now 3-ways: one instance on direct mount point in ext4, one on nfs mount mounted as
mount -o nfs and one on mount -o nfs4).
It must be hitting some rare race or a race + operations combination that leads to
the leak.

To reproduce I run it like this until it stops (and it usually stops in
2 or 3 iterations, so it takes some time):
$ cat ~/stress.sh 
#!/bin/bash

set -x

cd /home/green/git/lustre-release/lustre/tests/racer
dd if=/dev/zero of=/tmp/loop bs=1024k count=1024
mkfs.ext4 /tmp/loop
service rpcbind start
mount none /var/lib/nfs -t tmpfs
touch /var/lib/nfs/etab
service nfs-mountd start

while :; do

mount /tmp/loop /mnt/lustre -o loop || exit 1
mkdir /mnt/lustre/racer
service nfs-server start || exit 2
mount localhost:/mnt/lustre /mnt/nfs -t nfs -o nolock || exit 3
mount localhost:/ /mnt/nfs2 -t nfs4 || exit 4
DURATION=3600 sh racer.sh /mnt/nfs/racer &
DURATION=3600 sh racer.sh /mnt/nfs2/racer &
DURATION=3600 sh racer.sh /mnt/lustre/racer &
wait %1 %2 %3
umount /mnt/nfs || exit 5
umount /mnt/nfs2 || exit 6
service nfs-server stop || exit 7
umount /mnt/lustre || exit 8
e2fsck -n -f /tmp/loop || exit 9
done

Please ignore all the Lustre references, the test is generic, it's just I am running
off nfs-readonly-root setup, so want to reuse existing mountpoint locations,
there's no Lustre in picture here).
With all the other problems from before fixed, now it typically stops either for
this leak discussed here, or for this other problem in nfs4 I described here:
http://marc.info/?l=linux-nfs&m=146518324823160&w=2--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html