Fwd: nfs v4.2 leaking file descriptors

Bruno Santos <bacmsantos@xxxxxxxxx> · Mon, 15 Apr 2019 18:00:56 +0100

Hi all,

We have a debian stretch HPC cluster(#1 SMP Debian 4.9.130-2
(2018-10-27)). One of the machines mounts a couple of drives from a
Dell compellent system and shares it across a 10GB network to 4
different machines.

We had the nfs server crashing a few weeks ago because the file-max
limit had been reached. At the time we increased the number of file
handles it can handle and been monitoring since. We have noticed that
the number of entries on that machine keeps increasing though and
despite our best efforts we have been unable identify the cause.
Anything I can find related to this is from a well known bug in 2011
and nothing afterwards. We are assuming this is caused but a leak of
file handles on the nfs side but not sure.

Does anyone has anyway of figuring out what is causing this? Output
from the file-ne, lsof, etc is below.

Thank you very much for any help you can provide.

Best regards,
Bruno Santos

:~# while :;do echo "$(date): $(cat /proc/sys/fs/file-nr)";sleep
30;done
Mon 15 Apr 17:23:11 BST 2019: 2466176   0       4927726
Mon 15 Apr 17:23:41 BST 2019: 2466176   0       4927726
Mon 15 Apr 17:24:11 BST 2019: 2466336   0       4927726
Mon 15 Apr 17:24:41 BST 2019: 2466240   0       4927726
Mon 15 Apr 17:25:11 BST 2019: 2466560   0       4927726
Mon 15 Apr 17:25:41 BST 2019: 2466336   0       4927726
Mon 15 Apr 17:26:11 BST 2019: 2466400   0       4927726
Mon 15 Apr 17:26:41 BST 2019: 2466432   0       4927726
Mon 15 Apr 17:27:11 BST 2019: 2466688   0       4927726
Mon 15 Apr 17:27:41 BST 2019: 2466624   0       4927726
Mon 15 Apr 17:28:11 BST 2019: 2466784   0       4927726
Mon 15 Apr 17:28:41 BST 2019: 2466688   0       4927726
Mon 15 Apr 17:29:11 BST 2019: 2466816   0       4927726
Mon 15 Apr 17:29:42 BST 2019: 2466752   0       4927726
Mon 15 Apr 17:30:12 BST 2019: 2467072   0       4927726
Mon 15 Apr 17:30:42 BST 2019: 2466880   0       4927726

~# lsof|wc -l
3428