Hi Tom, Thanks for these points. On Thu, Jul 20, 2023 at 8:21 PM Tom Talpey <tom@xxxxxxxxxx> wrote: > > On 7/19/2023 8:10 AM, Shyam Prasad N wrote: > > Hi all, > > > > cifs.ko seems to be leaking handles occasionally when put under some > > stress testing. > > I was scanning the code for potential places where we could be > > leaking, and one particular instance caught my attention. > > > > In _cifsFileInfo_put, when the ref count of a cifs_file reaches 0, we > > remove it from the lists and then send the close request over the > > wire... > > if (!tcon->need_reconnect && !cifs_file->invalidHandle) { > > struct TCP_Server_Info *server = tcon->ses->server; > > unsigned int xid; > > > > xid = get_xid(); > > if (server->ops->close_getattr) > > server->ops->close_getattr(xid, tcon, cifs_file); > > else if (server->ops->close) > > server->ops->close(xid, tcon, &cifs_file->fid); > > _free_xid(xid); > > } > > > > But as you can see above, we do not have return value from the above handlers. > > Yeah, that's a problem. The most obvious is if the network becomes > partitioned and the close(s) fail. Are you injecting that kind of > error? So this was revealed with a stress testing setup where the mount was done against a server that gave out only 512 credits per connection. The connection was pretty much starved for credits, which threw up out-of-credits errors occasionally. I've confirmed that when it happens for close, there are handle leaks. > > Still, the logic is going to grow some serious hair if we start > retrying every error. What will bound the retries, and what kind > of error(s) might be considered fatal? Tying up credits, message > id's, handles, etc can be super problematic. > > Also, have you considered using some sort of laundromat? Or is it > critical that these closes happen before proceeding? > Steve and I discussed this yesterday. One option that came out was... We could cleanup the handle locally and then keep retrying the server close as a delayed work a fixed number of times. If a specified limit is exceeded, reconnect the session so that we start afresh. I guess this is what you meant by laundromat? > Tom. > > > What would happen if the above close_getattr or close calls fail? > > Particularly, what would happen if the failure happens even before the > > request is put in flight? > > In this stress testing we have the server giving out lesser credits. > > So with the testing, the credit counters on the active connections are > > expected to be low in general. > > I'm assuming that the above condition will leak handles. > > > > I was thinking about making a change to let the above handlers return > > rc rather than void. And then to check the status. > > If failure, create a delayed work for closing the remote handle. But > > I'm not convinced that this is the right approach. > > I'd like to know more comments on this. > > -- Regards, Shyam