Re: [PATCH] ceph: fail the request directly if handle_reply gets an ESTALE

Gregory Farnum <gfarnum@xxxxxxxxxx> · Wed, 16 Feb 2022 19:18:40 -0800

On Tue, Feb 8, 2022 at 10:00 PM Xiubo Li <xiubli@xxxxxxxxxx> wrote:
>
>
> On 2/8/22 1:11 AM, Jeff Layton wrote:
> > On Mon, 2022-02-07 at 08:28 -0800, Gregory Farnum wrote:
> >> On Mon, Feb 7, 2022 at 8:13 AM Jeff Layton <jlayton@xxxxxxxxxx> wrote:
> >>> The tracker bug mentions that this occurs after an MDS is restarted.
> >>> Could this be the result of clients relying on delete-on-last-close
> >>> behavior?
> >> Oooh, I didn't actually look at the tracker.
> >>
> >>> IOW, we have a situation where a file is opened and then unlinked, and
> >>> userland is actively doing I/O to it. The thing gets moved into the
> >>> strays dir, but isn't unlinked yet because we have open files against
> >>> it. Everything works fine at this point...
> >>>
> >>> Then, the MDS restarts and the inode gets purged altogether. Client
> >>> reconnects and tries to reclaim his open, and gets ESTALE.
> >> Uh, okay. So I didn't do a proper audit before I sent my previous
> >> reply, but one of the cases I did see was that the MDS returns ESTALE
> >> if you try to do a name lookup on an inode in the stray directory. I
> >> don't know if that's what is happening here or not? But perhaps that's
> >> the root of the problem in this case.
> >>
> >> Oh, nope, I see it's issuing getattr requests. That doesn't do ESTALE
> >> directly so it must indeed be coming out of MDCache::path_traverse.
> >>
> >> The MDS shouldn't move an inode into the purge queue on restart unless
> >> there were no clients with caps on it (that state is persisted to disk
> >> so it knows). Maybe if the clients don't make the reconnect window
> >> it's dropping them all and *then* moves it into purge queue? I think
> >> we need to identify what's happening there before we issue kernel
> >> client changes, Xiubo?
> >
> > Agreed. I think we need to understand why he's seeing ESTALE errors in
> > the first place, but it sounds like retrying on an ESTALE error isn't
> > likely to be helpful.
>
> There has one case that could cause the inode to be put into the purge
> queue:
>
> 1, When unlinking a file and just after the unlink journal log is
> flushed and the MDS is restart or replaced by a standby MDS. The unlink
> journal log will contain the a straydn and the straydn will link to the
> related CInode.
>
> 2, The new starting MDS will replay this unlink journal log in
> up:standby_replay state.
>
> 3, The MDCache::upkeep_main() thread will try to trim MDCache, and it
> will possibly trim the straydn. Since the clients haven't reconnected
> the sessions, so the CInode won't have any client cap. So when trimming
> the straydn and CInode, the CInode will be put into the purge queue.
>
> 4, After up:reconnect, when retrying the getattr requests the MDS will
> return ESTALE.
>
> This should be fixed in https://github.com/ceph/ceph/pull/41667
> recently, it will just enables trim() in up:active state.
>
> I also went through the ESTALE related code in MDS, this patch still
> makes sense and when getting an ESTALE errno to retry the request make
> no sense.

Thanks for checking; this sounds good to me.

Acked-by: Greg Farnum <gfarnum@xxxxxxxxxx>

>
> BRs
>
> Xiubo
>
>