Re: Zombie / Orphan open files

Jeff Layton <jlayton@xxxxxxxxxx> · Tue, 31 Jan 2023 17:28:22 -0500

On Tue, 2023-01-31 at 17:14 -0500, Olga Kornievskaia wrote:
> On Tue, Jan 31, 2023 at 2:55 PM Andrew J. Romero <romero@xxxxxxxx> wrote:
> > 
> > 
> > 
> > > What you are describing sounds like a bug in a system (be it client or
> > > server). There is state that the client thought it closed but the
> > > server still keeping that state.
> > 
> > Hi Olga
> > 
> > Based on my simple test script experiment,
> > Here's a summary of what I believe is happening
> > 
> > 1. An interactive user starts a process that opens a file or multiple files
> > 
> > 2. A disruption, that prevents
> >    NFS-client <-> NFS-server communication,
> >    occurs while the file is open.  This could be due to
> >    having the file open a long time or due to opening the file
> >    too close to the time of disruption.
> > 
> > ( I believe the most common "disruption" is
> >   credential expiration )
> > 
> > 3) The user's process terminates before the disruption
> >      is cleared.  ( or stated another way ,  the disruption is not cleared until after the user
> >     process terminates )
> > 
> >    At the time the user process terminates, the process
> >    can not tell the server to close the server-side file state.
> > 
> >   After the process terminates, nothing will ever tell the server
> >   to close the files.  The now zombie open files will continue to
> >   consume server-side resources.
> > 
> >   In environments with many users, the problem is significant
> > 
> > My reasons for posting:
> > 
> > - Are not to have your team  help troubleshoot my specific issue
> >    ( that would be quite rude )
> > 
> > they are:
> > 
> > - Determine If my NAS vendor might be accidentally
> >   not doing something they should be.
> >   (  I now don't really think this is the case. )
> 
> It's hard to say who's at fault here without having some more info
> like tracepoints or network traces.
> 
> > - Determine if this is a known behavior common to all NFS implementations
> >    ( Linux, ....etc ) and if so have your team determine if this is a problem that should be addressed
> >    in the spec and the implementations.
> 
> What you describe  --- having different views of state on the client
> and server -- is not a known common behaviour.
> 
> I have tried it on my Kerberos setup.
> Gotten a 5min ticket.
> As a user opened a file in a process that went to sleep.
> My user credentials have expired (after 5mins). I verified that by
> doing an "ls" on a mounted filesystem which resulted in permission
> denied error.
> Then I killed the application that had an opened file. This resulted
> in a NFS CLOSE being sent to the server using the machine's gss
> context (which is a default behaviour of the linux client regardless
> of whether or not user's credentials are valid).
> 
> Basically as far as I can tell, a linux client can handle cleaning up
> state when user's credentials have expired.
> > 

That's pretty much what I expected from looking at the code. I think
this is done via the call to nfs4_state_protect. That calls:

       if (test_bit(sp4_mode, &clp->cl_sp4_flags)) {                   
                msg->rpc_cred = rpc_machine_cred();
                ...                            
       }

Could it be that cl_sp4_flags doesn't have NFS_SP4_MACH_CRED_CLEANUP set
on his clients? AFAICT, that comes from the server. It also looks like
cl_sp4_flags may not get set on a NFSv4.0 mount.

Olga, can you test that with a v4.0 mount?
-- 
Jeff Layton <jlayton@xxxxxxxxxx>