RE: Zombie / Orphan open files

"Andrew J. Romero" <romero@xxxxxxxx> · Tue, 31 Jan 2023 16:59:14 +0000

> -----Original Message-----
> From: Chuck Lever III <chuck.lever@xxxxxxxxxx>
> 
> > On Jan 31, 2023, at 9:42 AM, Andrew J. Romero <romero@xxxxxxxx> wrote:
> >
> > In a large campus environment, usage of the relevant memory pool will eventually get so
> > high that a server-side reboot will be needed.
> 
> The above is sticking with me a bit.
> 
> Rebooting the server should force clients to re-establish state.
> 
> Are they not re-establishing open file state for users whose
> ticket has expired?

> I would think each client would re-establish
> state for those open files anyway, and the server would be in the
> same overcommitted state it was in before it rebooted.

When the number of opens gets close to the limit which would result in
a disruptive  NFSv4 service interruption ( currently 128K open files is the limit),
I do the reboot ( actually I transfer the affected NFS serving resource
from one NAS cluster-node to the other NAS cluster node ... this based on experience
is like a 99.9% "non-disruptive reboot" of the affected NFS serving resource )

Before the resource transfer there will be ~126K open files 
( from the NAS perspective )
0.1 seconds after the resource transfer there will be
close to zero files open. Within a few seconds there will
be ~2000 and within a few minutes there will be ~2100.
During the rest of the day I only see a slow rise in the average number
of opens to maybe 2200. ( my take is ~2100 files were "active opens" before and after
  the resource transfer ,  the rest of the 126K opens were zombies
that the clients were no longer using ).  In 4-6 months
the number of opens from the NAS perspective will slowly
creep back up to the limit.

> 
> We might not have an accurate root cause analysis yet, or I could
> be missing something.
> 
> --
> Chuck Lever
> 
>