Re: Fwd: nfs v4.2 leaking file descriptors

Bruno Santos <bacmsantos@xxxxxxxxx> · Wed, 17 Apr 2019 12:11:31 +0100

Thank you for the suggestion. I didn't actually know anything about
the slabinfo, very useful. And yes I can see the link therewith the
nfsd4_stateids and nfsd4_files increasing over time.
I can also see the following messages appearing on dmesg very frequently
Server side:
[5589745.074567] NFSD: client 10.1.10.190 testing state ID with
incorrect client ID
[5589745.076382] NFSD: client 10.1.10.83 testing state ID with
incorrect client ID
[5589745.077867] NFSD: client 10.1.10.190 testing state ID with
incorrect client ID
[5589745.080368] NFSD: client 10.1.10.83 testing state ID with
incorrect client ID
[5589745.080998] NFSD: client 10.1.10.190 testing state ID with
incorrect client ID
[5589745.085418] NFSD: client 10.1.10.83 testing state ID with
incorrect client ID

Also running:
#tshark -tad -i bond0 -Y "(nfs.status != 0)"
25574 2019-04-17 12:09:03.580443721  10.1.10.191 → 10.1.10.170  NFS
166 V4 Reply (Call In 25549) LOOKUP Status: NFS4ERR_NOENT
261264 2019-04-17 12:09:39.692976173  10.1.10.191 → 10.1.10.170  NFS
166 V4 Reply (Call In 259893) CLOSE Status: NFS4ERR_OLD_STATEID
261273 2019-04-17 12:09:39.705359506  10.1.10.191 → 10.1.10.170  NFS
166 V4 Reply (Call In 260065) CLOSE Status: NFS4ERR_OLD_STATEID
261649 2019-04-17 12:09:39.718857437  10.1.10.191 → 10.1.10.170  NFS
166 V4 Reply (Call In 260517) CLOSE Status: NFS4ERR_OLD_STATEID
290633 2019-04-17 12:09:41.520550924  10.1.10.191 → 10.1.10.170  NFS
166 V4 Reply (Call In 290588) CLOSE Status: NFS4ERR_OLD_STATEID
292829 2019-04-17 12:09:41.572183869  10.1.10.191 → 10.1.10.170  NFS
166 [TCP ACKed unseen segment] V4 Reply (Call In 291959) LOOKUP
Status: NFS4ERR_NOENT
297536 2019-04-17 12:09:41.728307777  10.1.10.191 → 10.1.10.170  NFS
166 V4 Reply (Call In 297266) LOOKUP Status: NFS4ERR_NOENT
326074 2019-04-17 12:09:43.872781379  10.1.10.191 → 10.1.10.170  NFS
166 V4 Reply (Call In 326073) CLOSE Status: NFS4ERR_OLD_STATEID
326079 2019-04-17 12:09:43.873222013  10.1.10.191 → 10.1.10.170  NFS
166 V4 Reply (Call In 326078) CLOSE Status: NFS4ERR_OLD_STATEID
326103 2019-04-17 12:09:43.873925732  10.1.10.191 → 10.1.10.170  NFS
166 V4 Reply (Call In 326102) CLOSE Status: NFS4ERR_OLD_STATEID
328402 2019-04-17 12:09:44.002984136  10.1.10.191 → 10.1.10.170  NFS
166 V4 Reply (Call In 328400) CLOSE Status: NFS4ERR_OLD_STATEID
root@ni-hpc-02:~# tshark -tad -i bond0 -Y "(nfs.status != 0)"
Running as user "root" and group "root". This could be dangerous.
Capturing on 'bond0'
53046 2019-04-17 12:10:44.818936562  10.1.10.191 → 10.1.10.170  NFS
166 V4 Reply (Call In 53043) CLOSE Status: NFS4ERR_OLD_STATEID
53048 2019-04-17 12:10:44.818967848  10.1.10.191 → 10.1.10.170  NFS
166 V4 Reply (Call In 53044) CLOSE Status: NFS4ERR_OLD_STATEID
53049 2019-04-17 12:10:44.818990681  10.1.10.191 → 10.1.10.170  NFS
166 V4 Reply (Call In 53045) CLOSE Status: NFS4ERR_OLD_STATEID
53057 2019-04-17 12:10:44.819085752  10.1.10.191 → 10.1.10.170  NFS
410 V4 Reply (Call In 53052) CLOSE Status: NFS4ERR_OLD_STATEID  ; V4
Reply (Call In 53052) GETATTR
53068 2019-04-17 12:10:44.819315507  10.1.10.191 → 10.1.10.170  NFS
166 V4 Reply (Call In 53065) CLOSE Status: NFS4ERR_OLD_STATEID
53076 2019-04-17 12:10:44.819542785  10.1.10.191 → 10.1.10.170  NFS
166 V4 Reply (Call In 53074) CLOSE Status: NFS4ERR_OLD_STATEID
53672 2019-04-17 12:10:44.908883083  10.1.10.191 → 10.1.10.170  NFS
166 V4 Reply (Call In 53082) CLOSE Status: NFS4ERR_OLD_STATEID
88712 2019-04-17 12:10:46.682247016  10.1.10.191 → 10.1.10.170  NFS
166 V4 Reply (Call In 88711) CLOSE Status: NFS4ERR_OLD_STATEID
88713 2019-04-17 12:10:46.682262932  10.1.10.191 → 10.1.10.170  NFS
166 V4 Reply (Call In 88711) CLOSE Status: NFS4ERR_OLD_STATEID
88725 2019-04-17 12:10:46.683030796  10.1.10.191 → 10.1.10.170  NFS
166 V4 Reply (Call In 88723) CLOSE Status: NFS4ERR_OLD_STATEID
88730 2019-04-17 12:10:46.683209944  10.1.10.191 → 10.1.10.170  NFS
166 V4 Reply (Call In 88729) CLOSE Status: NFS4ERR_OLD_STATEID
93002 2019-04-17 12:10:46.879404035  10.1.10.191 → 10.1.10.170  NFS
898 V4 Reply (Call In 92976) GETATTR  ; V4 Reply (Call In 92990)
GETATTR  ; V4 Reply (Call In 92997) GETATTR  ; V4 Reply (Call In
92999) CLOSE Status: NFS4ERR_OLD_STATEID
93052 2019-04-17 12:10:46.880706310  10.1.10.191 → 10.1.10.170  NFS
166 V4 Reply (Call In 93051) CLOSE Status: NFS4ERR_OLD_STATEID
93056 2019-04-17 12:10:46.880830245  10.1.10.191 → 10.1.10.170  NFS
166 V4 Reply (Call In 93054) CLOSE Status: NFS4ERR_OLD_STATEID
126695 2019-04-17 12:10:48.760521096  10.1.10.191 → 10.1.10.170  NFS
166 V4 Reply (Call In 126567) CLOSE Status: NFS4ERR_OLD_STATEID
126700 2019-04-17 12:10:48.760586900  10.1.10.191 → 10.1.10.170  NFS
410 V4 Reply (Call In 126568) GETATTR  ; V4 Reply (Call In 126570)
CLOSE Status: NFS4ERR_OLD_STATEID
126704 2019-04-17 12:10:48.760669346  10.1.10.191 → 10.1.10.170  NFS
166 V4 Reply (Call In 126578) CLOSE Status: NFS4ERR_OLD_STATEID
126708 2019-04-17 12:10:48.760786205  10.1.10.191 → 10.1.10.170  NFS
410 V4 Reply (Call In 126586) GETATTR  ; V4 Reply (Call In 126587)
CLOSE Status: NFS4ERR_OLD_STATEID
226205 2019-04-17 12:10:56.798335876  10.1.10.191 → 10.1.10.170  NFS
266 V4 Reply (Call In 226170) OPEN Status: NFS4ERR_DELAY  ; V4 Reply
(Call In 226199) TEST_STATEID
226206 2019-04-17 12:10:56.798412890  10.1.10.191 → 10.1.10.170  NFS
166 V4 Reply (Call In 226200) OPEN Status: NFS4ERR_DELAY
253665 2019-04-17 12:10:58.048772590  10.1.10.191 → 10.1.10.170  NFS
1470 V4 Reply (Call In 253658) READV4 Reply (Call In 253659) CLOSE  ;
V4 Reply (Call In 253661) CLOSE Status: NFS4ERR_OLD_STATEID
253671 2019-04-17 12:10:58.048952402  10.1.10.191 → 10.1.10.170  NFS
166 V4 Reply (Call In 253667) CLOSE Status: NFS4ERR_OLD_STATEID
368579 2019-04-17 12:11:04.582691779  10.1.10.191 → 10.1.10.170  NFS
166 V4 Reply (Call In 368574) CLOSE Status: NFS4ERR_BAD_STATEID
^C66792 packets dropped
23 packets captured

There seems to be nothing on any of those clients side of things.

Bruno

On Tue, 16 Apr 2019 at 21:46, J. Bruce Fields <bfields@xxxxxxxxxxxx> wrote:
>
> On Mon, Apr 15, 2019 at 06:00:56PM +0100, Bruno Santos wrote:
> > We have a debian stretch HPC cluster(#1 SMP Debian 4.9.130-2
> > (2018-10-27)). One of the machines mounts a couple of drives from a
> > Dell compellent system and shares it across a 10GB network to 4
> > different machines.
> >
> > We had the nfs server crashing a few weeks ago because the file-max
> > limit had been reached. At the time we increased the number of file
> > handles it can handle and been monitoring since. We have noticed that
> > the number of entries on that machine keeps increasing though and
> > despite our best efforts we have been unable identify the cause.
> > Anything I can find related to this is from a well known bug in 2011
> > and nothing afterwards. We are assuming this is caused but a leak of
> > file handles on the nfs side but not sure.
> >
> > Does anyone has anyway of figuring out what is causing this? Output
> > from the file-ne, lsof, etc is below.
>
> Off the top of my head, the only idea I have is to try watching
>
>         grep nfsd4 /proc/slabinfo
>
> and see if any of those objects are also leaking.
>
> --b.
>
> >
> > Thank you very much for any help you can provide.
> >
> > Best regards,
> > Bruno Santos
> >
> > :~# while :;do echo "$(date): $(cat /proc/sys/fs/file-nr)";sleep
> > 30;done
> > Mon 15 Apr 17:23:11 BST 2019: 2466176   0       4927726
> > Mon 15 Apr 17:23:41 BST 2019: 2466176   0       4927726
> > Mon 15 Apr 17:24:11 BST 2019: 2466336   0       4927726
> > Mon 15 Apr 17:24:41 BST 2019: 2466240   0       4927726
> > Mon 15 Apr 17:25:11 BST 2019: 2466560   0       4927726
> > Mon 15 Apr 17:25:41 BST 2019: 2466336   0       4927726
> > Mon 15 Apr 17:26:11 BST 2019: 2466400   0       4927726
> > Mon 15 Apr 17:26:41 BST 2019: 2466432   0       4927726
> > Mon 15 Apr 17:27:11 BST 2019: 2466688   0       4927726
> > Mon 15 Apr 17:27:41 BST 2019: 2466624   0       4927726
> > Mon 15 Apr 17:28:11 BST 2019: 2466784   0       4927726
> > Mon 15 Apr 17:28:41 BST 2019: 2466688   0       4927726
> > Mon 15 Apr 17:29:11 BST 2019: 2466816   0       4927726
> > Mon 15 Apr 17:29:42 BST 2019: 2466752   0       4927726
> > Mon 15 Apr 17:30:12 BST 2019: 2467072   0       4927726
> > Mon 15 Apr 17:30:42 BST 2019: 2466880   0       4927726
> >
> > ~# lsof|wc -l
> > 3428