Re: Potential leak in file put

Shyam Prasad N <nspmangalore@xxxxxxxxx> · Fri, 21 Jul 2023 16:43:05 +0530

Hi Tom,

Thanks for these points.

On Thu, Jul 20, 2023 at 8:21 PM Tom Talpey <tom@xxxxxxxxxx> wrote:
>
> On 7/19/2023 8:10 AM, Shyam Prasad N wrote:
> > Hi all,
> >
> > cifs.ko seems to be leaking handles occasionally when put under some
> > stress testing.
> > I was scanning the code for potential places where we could be
> > leaking, and one particular instance caught my attention.
> >
> > In _cifsFileInfo_put, when the ref count of a cifs_file reaches 0, we
> > remove it from the lists and then send the close request over the
> > wire...
> >          if (!tcon->need_reconnect && !cifs_file->invalidHandle) {
> >                  struct TCP_Server_Info *server = tcon->ses->server;
> >                  unsigned int xid;
> >
> >                  xid = get_xid();
> >                  if (server->ops->close_getattr)
> >                          server->ops->close_getattr(xid, tcon, cifs_file);
> >                  else if (server->ops->close)
> >                          server->ops->close(xid, tcon, &cifs_file->fid);
> >                  _free_xid(xid);
> >          }
> >
> > But as you can see above, we do not have return value from the above handlers.
>
> Yeah, that's a problem. The most obvious is if the network becomes
> partitioned and the close(s) fail. Are you injecting that kind of
> error?

So this was revealed with a stress testing setup where the mount was
done against a server that gave out only 512 credits per connection.
The connection was pretty much starved for credits, which threw up
out-of-credits errors occasionally.
I've confirmed that when it happens for close, there are handle leaks.

>
> Still, the logic is going to grow some serious hair if we start
> retrying every error. What will bound the retries, and what kind
> of error(s) might be considered fatal? Tying up credits, message
> id's, handles, etc can be super problematic.
>
> Also, have you considered using some sort of laundromat? Or is it
> critical that these closes happen before proceeding?
>

Steve and I discussed this yesterday. One option that came out was...
We could cleanup the handle locally and then keep retrying the server
close as a delayed work a fixed number of times.
If a specified limit is exceeded, reconnect the session so that we start afresh.
I guess this is what you meant by laundromat?

> Tom.
>
> > What would happen if the above close_getattr or close calls fail?
> > Particularly, what would happen if the failure happens even before the
> > request is put in flight?
> > In this stress testing we have the server giving out lesser credits.
> > So with the testing, the credit counters on the active connections are
> > expected to be low in general.
> > I'm assuming that the above condition will leak handles.
> >
> > I was thinking about making a change to let the above handlers return
> > rc rather than void. And then to check the status.
> > If failure, create a delayed work for closing the remote handle. But
> > I'm not convinced that this is the right approach.
> > I'd like to know more comments on this.
> >

-- 
Regards,
Shyam