Re: [PATCH] ceph: fail the request directly if handle_reply gets an ESTALE

Xiubo Li <xiubli@xxxxxxxxxx> · Wed, 9 Feb 2022 09:35:44 +0800

On 2/8/22 11:16 PM, Gregory Farnum wrote:
On Mon, Feb 7, 2022 at 11:56 PM Xiubo Li <xiubli@xxxxxxxxxx> wrote:

On 2/7/22 11:12 PM, Jeff Layton wrote:
On Mon, 2022-02-07 at 13:03 +0800, xiubli@xxxxxxxxxx wrote:
From: Xiubo Li <xiubli@xxxxxxxxxx>

If MDS return ESTALE, that means the MDS has already iterated all
the possible active MDSes including the auth MDS or the inode is
under purging. No need to retry in auth MDS and will just return
ESTALE directly.

When you say "purging" here, do you mean that it's effectively being
cleaned up after being unlinked? Or is it just being purged from the
MDS's cache?
There is one case when the client just removes the file or the file is
force overrode via renaming, the related dentry in MDS will be put to
stray dir and the Inode will be marked as under purging.
I'm confused by this statement. Renaming inode B over inode A should
still just be a normal unlink for A, and not trigger an immediate
purge. Is there something in the MDS that force-starts that and breaks
clients who still have the inode held open?

Yeah, you are right. I went through the mds related code again and if 
there has any caps is held by clients will it be deferred queuing to the 
purge queue.

I will went the mds code today try to figure out whether there has bugs 
when replaying the journals and reconnecting the clients when the MDS 
starting.

Thanks.

-Greg

So in case when
the client try to call getattr, for example, after that, or retries the
pending getattr request, which can cause the MDS returns ESTALE errno in
theory.

Locally I still haven't reproduced it yet.

Or it will cause definite loop for retrying it.

URL: https://tracker.ceph.com/issues/53504
Signed-off-by: Xiubo Li <xiubli@xxxxxxxxxx>
---
   fs/ceph/mds_client.c | 29 -----------------------------
   1 file changed, 29 deletions(-)

diff --git a/fs/ceph/mds_client.c b/fs/ceph/mds_client.c
index 93e5e3c4ba64..c918d2ac8272 100644
--- a/fs/ceph/mds_client.c
+++ b/fs/ceph/mds_client.c
@@ -3368,35 +3368,6 @@ static void handle_reply(struct ceph_mds_session *session, struct ceph_msg *msg)

      result = le32_to_cpu(head->result);

-    /*
-     * Handle an ESTALE
-     * if we're not talking to the authority, send to them
-     * if the authority has changed while we weren't looking,
-     * send to new authority
-     * Otherwise we just have to return an ESTALE
-     */
-    if (result == -ESTALE) {
-            dout("got ESTALE on request %llu\n", req->r_tid);
-            req->r_resend_mds = -1;
-            if (req->r_direct_mode != USE_AUTH_MDS) {
-                    dout("not using auth, setting for that now\n");
-                    req->r_direct_mode = USE_AUTH_MDS;
-                    __do_request(mdsc, req);
-                    mutex_unlock(&mdsc->mutex);
-                    goto out;
-            } else  {
-                    int mds = __choose_mds(mdsc, req, NULL);
-                    if (mds >= 0 && mds != req->r_session->s_mds) {
-                            dout("but auth changed, so resending\n");
-                            __do_request(mdsc, req);
-                            mutex_unlock(&mdsc->mutex);
-                            goto out;
-                    }
-            }
-            dout("have to return ESTALE on request %llu\n", req->r_tid);
-    }
-
-
      if (head->safe) {
              set_bit(CEPH_MDS_R_GOT_SAFE, &req->r_req_flags);
              __unregister_request(mdsc, req);
(cc'ing Greg, Sage and Zheng)

This patch sort of contradicts the original design, AFAICT, and I'm not
sure what the correct behavior should be. I could use some
clarification.

The original code (from the 2009 merge) would tolerate 2 ESTALEs before
giving up and returning that to userland. Then in 2010, Greg added this
commit:

      https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=e55b71f802fd448a79275ba7b263fe1a8639be5f

...which would presumably make it retry indefinitely as long as the auth
MDS kept changing. Then, Zheng made this change in 2013:

      https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=ca18bede048e95a749d13410ce1da4ad0ffa7938

...which seems to try to do the same thing, but detected the auth mds
change in a different way.

Is that where livelock detection was broken? Or was there some
corresponding change to __choose_mds that should prevent infinitely
looping on the same request?

In NFS, ESTALE errors mean that the filehandle (inode) no longer exists
and that the server has forgotten about it. Does it mean the same thing
to the ceph MDS?

Has the behavior of the MDS changed such that these retries are no
longer necessary on an ESTALE? If so, when did this change, and does the
client need to do anything to detect what behavior it should be using?