Re: [PATCH] ceph: fail the request directly if handle_reply gets an ESTALE

Gregory Farnum <gfarnum@xxxxxxxxxx> · Tue, 8 Feb 2022 07:16:38 -0800



On Mon, Feb 7, 2022 at 11:56 PM Xiubo Li <xiubli@xxxxxxxxxx> wrote:
>
>
> On 2/7/22 11:12 PM, Jeff Layton wrote:
> > On Mon, 2022-02-07 at 13:03 +0800, xiubli@xxxxxxxxxx wrote:
> >> From: Xiubo Li <xiubli@xxxxxxxxxx>
> >>
> >> If MDS return ESTALE, that means the MDS has already iterated all
> >> the possible active MDSes including the auth MDS or the inode is
> >> under purging. No need to retry in auth MDS and will just return
> >> ESTALE directly.
> >>
> > When you say "purging" here, do you mean that it's effectively being
> > cleaned up after being unlinked? Or is it just being purged from the
> > MDS's cache?
>
> There is one case when the client just removes the file or the file is
> force overrode via renaming, the related dentry in MDS will be put to
> stray dir and the Inode will be marked as under purging.

I'm confused by this statement. Renaming inode B over inode A should
still just be a normal unlink for A, and not trigger an immediate
purge. Is there something in the MDS that force-starts that and breaks
clients who still have the inode held open?
-Greg

> So in case when
> the client try to call getattr, for example, after that, or retries the
> pending getattr request, which can cause the MDS returns ESTALE errno in
> theory.
>
> Locally I still haven't reproduced it yet.
>
> >> Or it will cause definite loop for retrying it.
> >>
> >> URL: https://tracker.ceph.com/issues/53504
> >> Signed-off-by: Xiubo Li <xiubli@xxxxxxxxxx>
> >> ---
> >>   fs/ceph/mds_client.c | 29 -----------------------------
> >>   1 file changed, 29 deletions(-)
> >>
> >> diff --git a/fs/ceph/mds_client.c b/fs/ceph/mds_client.c
> >> index 93e5e3c4ba64..c918d2ac8272 100644
> >> --- a/fs/ceph/mds_client.c
> >> +++ b/fs/ceph/mds_client.c
> >> @@ -3368,35 +3368,6 @@ static void handle_reply(struct ceph_mds_session *session, struct ceph_msg *msg)
> >>
> >>      result = le32_to_cpu(head->result);
> >>
> >> -    /*
> >> -     * Handle an ESTALE
> >> -     * if we're not talking to the authority, send to them
> >> -     * if the authority has changed while we weren't looking,
> >> -     * send to new authority
> >> -     * Otherwise we just have to return an ESTALE
> >> -     */
> >> -    if (result == -ESTALE) {
> >> -            dout("got ESTALE on request %llu\n", req->r_tid);
> >> -            req->r_resend_mds = -1;
> >> -            if (req->r_direct_mode != USE_AUTH_MDS) {
> >> -                    dout("not using auth, setting for that now\n");
> >> -                    req->r_direct_mode = USE_AUTH_MDS;
> >> -                    __do_request(mdsc, req);
> >> -                    mutex_unlock(&mdsc->mutex);
> >> -                    goto out;
> >> -            } else  {
> >> -                    int mds = __choose_mds(mdsc, req, NULL);
> >> -                    if (mds >= 0 && mds != req->r_session->s_mds) {
> >> -                            dout("but auth changed, so resending\n");
> >> -                            __do_request(mdsc, req);
> >> -                            mutex_unlock(&mdsc->mutex);
> >> -                            goto out;
> >> -                    }
> >> -            }
> >> -            dout("have to return ESTALE on request %llu\n", req->r_tid);
> >> -    }
> >> -
> >> -
> >>      if (head->safe) {
> >>              set_bit(CEPH_MDS_R_GOT_SAFE, &req->r_req_flags);
> >>              __unregister_request(mdsc, req);
> >
> > (cc'ing Greg, Sage and Zheng)
> >
> > This patch sort of contradicts the original design, AFAICT, and I'm not
> > sure what the correct behavior should be. I could use some
> > clarification.
> >
> > The original code (from the 2009 merge) would tolerate 2 ESTALEs before
> > giving up and returning that to userland. Then in 2010, Greg added this
> > commit:
> >
> >      https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=e55b71f802fd448a79275ba7b263fe1a8639be5f
> >
> > ...which would presumably make it retry indefinitely as long as the auth
> > MDS kept changing. Then, Zheng made this change in 2013:
> >
> >      https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=ca18bede048e95a749d13410ce1da4ad0ffa7938
> >
> > ...which seems to try to do the same thing, but detected the auth mds
> > change in a different way.
> >
> > Is that where livelock detection was broken? Or was there some
> > corresponding change to __choose_mds that should prevent infinitely
> > looping on the same request?
> >
> > In NFS, ESTALE errors mean that the filehandle (inode) no longer exists
> > and that the server has forgotten about it. Does it mean the same thing
> > to the ceph MDS?
> >
> > Has the behavior of the MDS changed such that these retries are no
> > longer necessary on an ESTALE? If so, when did this change, and does the
> > client need to do anything to detect what behavior it should be using?
>