On Mon, Apr 15, 2019 at 06:00:56PM +0100, Bruno Santos wrote: > We have a debian stretch HPC cluster(#1 SMP Debian 4.9.130-2 > (2018-10-27)). One of the machines mounts a couple of drives from a > Dell compellent system and shares it across a 10GB network to 4 > different machines. > > We had the nfs server crashing a few weeks ago because the file-max > limit had been reached. At the time we increased the number of file > handles it can handle and been monitoring since. We have noticed that > the number of entries on that machine keeps increasing though and > despite our best efforts we have been unable identify the cause. > Anything I can find related to this is from a well known bug in 2011 > and nothing afterwards. We are assuming this is caused but a leak of > file handles on the nfs side but not sure. > > Does anyone has anyway of figuring out what is causing this? Output > from the file-ne, lsof, etc is below. Off the top of my head, the only idea I have is to try watching grep nfsd4 /proc/slabinfo and see if any of those objects are also leaking. --b. > > Thank you very much for any help you can provide. > > Best regards, > Bruno Santos > > :~# while :;do echo "$(date): $(cat /proc/sys/fs/file-nr)";sleep > 30;done > Mon 15 Apr 17:23:11 BST 2019: 2466176 0 4927726 > Mon 15 Apr 17:23:41 BST 2019: 2466176 0 4927726 > Mon 15 Apr 17:24:11 BST 2019: 2466336 0 4927726 > Mon 15 Apr 17:24:41 BST 2019: 2466240 0 4927726 > Mon 15 Apr 17:25:11 BST 2019: 2466560 0 4927726 > Mon 15 Apr 17:25:41 BST 2019: 2466336 0 4927726 > Mon 15 Apr 17:26:11 BST 2019: 2466400 0 4927726 > Mon 15 Apr 17:26:41 BST 2019: 2466432 0 4927726 > Mon 15 Apr 17:27:11 BST 2019: 2466688 0 4927726 > Mon 15 Apr 17:27:41 BST 2019: 2466624 0 4927726 > Mon 15 Apr 17:28:11 BST 2019: 2466784 0 4927726 > Mon 15 Apr 17:28:41 BST 2019: 2466688 0 4927726 > Mon 15 Apr 17:29:11 BST 2019: 2466816 0 4927726 > Mon 15 Apr 17:29:42 BST 2019: 2466752 0 4927726 > Mon 15 Apr 17:30:12 BST 2019: 2467072 0 4927726 > Mon 15 Apr 17:30:42 BST 2019: 2466880 0 4927726 > > ~# lsof|wc -l > 3428