Re: Potential leak in file put

Shyam Prasad N <nspmangalore@xxxxxxxxx> · Wed, 26 Jul 2023 23:19:23 +0530

On Fri, Jul 21, 2023 at 7:21 PM Tom Talpey <tom@xxxxxxxxxx> wrote:
>
> On 7/21/2023 7:13 AM, Shyam Prasad N wrote:
> > Hi Tom,
> >
> > Thanks for these points.
> >
> > On Thu, Jul 20, 2023 at 8:21 PM Tom Talpey <tom@xxxxxxxxxx> wrote:
> >>
> >> On 7/19/2023 8:10 AM, Shyam Prasad N wrote:
> >>> Hi all,
> >>>
> >>> cifs.ko seems to be leaking handles occasionally when put under some
> >>> stress testing.
> >>> I was scanning the code for potential places where we could be
> >>> leaking, and one particular instance caught my attention.
> >>>
> >>> In _cifsFileInfo_put, when the ref count of a cifs_file reaches 0, we
> >>> remove it from the lists and then send the close request over the
> >>> wire...
> >>>           if (!tcon->need_reconnect && !cifs_file->invalidHandle) {
> >>>                   struct TCP_Server_Info *server = tcon->ses->server;
> >>>                   unsigned int xid;
> >>>
> >>>                   xid = get_xid();
> >>>                   if (server->ops->close_getattr)
> >>>                           server->ops->close_getattr(xid, tcon, cifs_file);
> >>>                   else if (server->ops->close)
> >>>                           server->ops->close(xid, tcon, &cifs_file->fid);
> >>>                   _free_xid(xid);
> >>>           }
> >>>
> >>> But as you can see above, we do not have return value from the above handlers.
> >>
> >> Yeah, that's a problem. The most obvious is if the network becomes
> >> partitioned and the close(s) fail. Are you injecting that kind of
> >> error?
> >
> > So this was revealed with a stress testing setup where the mount was
> > done against a server that gave out only 512 credits per connection.
> > The connection was pretty much starved for credits, which threw up
> > out-of-credits errors occasionally.
> > I've confirmed that when it happens for close, there are handle leaks.
>
> Interesting. 512 credits doesn't seem like starvation, but it will
> definitely stress a high workload. Good!
>
> >> Still, the logic is going to grow some serious hair if we start
> >> retrying every error. What will bound the retries, and what kind
> >> of error(s) might be considered fatal? Tying up credits, message
> >> id's, handles, etc can be super problematic.
> >>
> >> Also, have you considered using some sort of laundromat? Or is it
> >> critical that these closes happen before proceeding?
> >>
> >
> > Steve and I discussed this yesterday. One option that came out was...
> > We could cleanup the handle locally and then keep retrying the server
> > close as a delayed work a fixed number of times.
> > If a specified limit is exceeded, reconnect the session so that we start afresh.
>
> Sounds reasonable, but things might get a little tricky if the
> server-side handle has a lease or some other state still attached.
> A subsequent client create on the same file might encounter an
> unexpected conflict? It's critical to think that through.

You make a good point. I did think about this.
There are several cases here.
1. If the subsequent open is from another client B: This would force
the server to send a lease break to this client A. But since the
client has closed all local data for the handle (only the server close
is pending), the lease break would go unrecognized. That is not a good
thing to do. So we'll need to keep zombie handles lying around (and in
the list) till the file is actually closed on the server. We just need
to make sure we mark these handles as zombies so future opens to the
file do not reuse this handle.
2. If the subsequent open is from the same client A: If the client
cannot find the handle, it can try an open on the server. However,
that open could fail depending on the open mode by earlier handle. So
I think we could use the above idea of zombie handles here too, where
the new opens could be failed with a retriable error (EAGAIN?).

Do you see a problem with this approach?

>
> > I guess this is what you meant by laundromat?
>
> So by laundromat I meant background processing to handle the
> close. It would have some sort of queued work list, and it
> would process the work items and wash-dry-fold the results.
>
> The main motivation would be to release the thread that fell
> into the refcnt == 0 so the calling application may continue.
> Stealing the thread for a full server roundtrip might be
> worth avoiding, in all cases.
>
> OTOH, if the tricky stuff above is risky or wrong, then forget
> the laundromat and don't return until the close is done. But
> then, think about ^C!
>
> Tom.
>
> >>> What would happen if the above close_getattr or close calls fail?
> >>> Particularly, what would happen if the failure happens even before the
> >>> request is put in flight?
> >>> In this stress testing we have the server giving out lesser credits.
> >>> So with the testing, the credit counters on the active connections are
> >>> expected to be low in general.
> >>> I'm assuming that the above condition will leak handles.
> >>>
> >>> I was thinking about making a change to let the above handlers return
> >>> rc rather than void. And then to check the status.
> >>> If failure, create a delayed work for closing the remote handle. But
> >>> I'm not convinced that this is the right approach.
> >>> I'd like to know more comments on this.
> >>>
> >
> >
> >

-- 
Regards,
Shyam