On 7/21/2023 7:13 AM, Shyam Prasad N wrote:
Hi Tom,
Thanks for these points.
On Thu, Jul 20, 2023 at 8:21 PM Tom Talpey <tom@xxxxxxxxxx> wrote:
On 7/19/2023 8:10 AM, Shyam Prasad N wrote:
Hi all,
cifs.ko seems to be leaking handles occasionally when put under some
stress testing.
I was scanning the code for potential places where we could be
leaking, and one particular instance caught my attention.
In _cifsFileInfo_put, when the ref count of a cifs_file reaches 0, we
remove it from the lists and then send the close request over the
wire...
if (!tcon->need_reconnect && !cifs_file->invalidHandle) {
struct TCP_Server_Info *server = tcon->ses->server;
unsigned int xid;
xid = get_xid();
if (server->ops->close_getattr)
server->ops->close_getattr(xid, tcon, cifs_file);
else if (server->ops->close)
server->ops->close(xid, tcon, &cifs_file->fid);
_free_xid(xid);
}
But as you can see above, we do not have return value from the above handlers.
Yeah, that's a problem. The most obvious is if the network becomes
partitioned and the close(s) fail. Are you injecting that kind of
error?
So this was revealed with a stress testing setup where the mount was
done against a server that gave out only 512 credits per connection.
The connection was pretty much starved for credits, which threw up
out-of-credits errors occasionally.
I've confirmed that when it happens for close, there are handle leaks.
Interesting. 512 credits doesn't seem like starvation, but it will
definitely stress a high workload. Good!
Still, the logic is going to grow some serious hair if we start
retrying every error. What will bound the retries, and what kind
of error(s) might be considered fatal? Tying up credits, message
id's, handles, etc can be super problematic.
Also, have you considered using some sort of laundromat? Or is it
critical that these closes happen before proceeding?
Steve and I discussed this yesterday. One option that came out was...
We could cleanup the handle locally and then keep retrying the server
close as a delayed work a fixed number of times.
If a specified limit is exceeded, reconnect the session so that we start afresh.
Sounds reasonable, but things might get a little tricky if the
server-side handle has a lease or some other state still attached.
A subsequent client create on the same file might encounter an
unexpected conflict? It's critical to think that through.
I guess this is what you meant by laundromat?
So by laundromat I meant background processing to handle the
close. It would have some sort of queued work list, and it
would process the work items and wash-dry-fold the results.
The main motivation would be to release the thread that fell
into the refcnt == 0 so the calling application may continue.
Stealing the thread for a full server roundtrip might be
worth avoiding, in all cases.
OTOH, if the tricky stuff above is risky or wrong, then forget
the laundromat and don't return until the close is done. But
then, think about ^C!
Tom.
What would happen if the above close_getattr or close calls fail?
Particularly, what would happen if the failure happens even before the
request is put in flight?
In this stress testing we have the server giving out lesser credits.
So with the testing, the credit counters on the active connections are
expected to be low in general.
I'm assuming that the above condition will leak handles.
I was thinking about making a change to let the above handlers return
rc rather than void. And then to check the status.
If failure, create a delayed work for closing the remote handle. But
I'm not convinced that this is the right approach.
I'd like to know more comments on this.