On Tue, Feb 8, 2022 at 10:00 PM Xiubo Li <xiubli@xxxxxxxxxx> wrote: > > > On 2/8/22 1:11 AM, Jeff Layton wrote: > > On Mon, 2022-02-07 at 08:28 -0800, Gregory Farnum wrote: > >> On Mon, Feb 7, 2022 at 8:13 AM Jeff Layton <jlayton@xxxxxxxxxx> wrote: > >>> The tracker bug mentions that this occurs after an MDS is restarted. > >>> Could this be the result of clients relying on delete-on-last-close > >>> behavior? > >> Oooh, I didn't actually look at the tracker. > >> > >>> IOW, we have a situation where a file is opened and then unlinked, and > >>> userland is actively doing I/O to it. The thing gets moved into the > >>> strays dir, but isn't unlinked yet because we have open files against > >>> it. Everything works fine at this point... > >>> > >>> Then, the MDS restarts and the inode gets purged altogether. Client > >>> reconnects and tries to reclaim his open, and gets ESTALE. > >> Uh, okay. So I didn't do a proper audit before I sent my previous > >> reply, but one of the cases I did see was that the MDS returns ESTALE > >> if you try to do a name lookup on an inode in the stray directory. I > >> don't know if that's what is happening here or not? But perhaps that's > >> the root of the problem in this case. > >> > >> Oh, nope, I see it's issuing getattr requests. That doesn't do ESTALE > >> directly so it must indeed be coming out of MDCache::path_traverse. > >> > >> The MDS shouldn't move an inode into the purge queue on restart unless > >> there were no clients with caps on it (that state is persisted to disk > >> so it knows). Maybe if the clients don't make the reconnect window > >> it's dropping them all and *then* moves it into purge queue? I think > >> we need to identify what's happening there before we issue kernel > >> client changes, Xiubo? > > > > Agreed. I think we need to understand why he's seeing ESTALE errors in > > the first place, but it sounds like retrying on an ESTALE error isn't > > likely to be helpful. > > There has one case that could cause the inode to be put into the purge > queue: > > 1, When unlinking a file and just after the unlink journal log is > flushed and the MDS is restart or replaced by a standby MDS. The unlink > journal log will contain the a straydn and the straydn will link to the > related CInode. > > 2, The new starting MDS will replay this unlink journal log in > up:standby_replay state. > > 3, The MDCache::upkeep_main() thread will try to trim MDCache, and it > will possibly trim the straydn. Since the clients haven't reconnected > the sessions, so the CInode won't have any client cap. So when trimming > the straydn and CInode, the CInode will be put into the purge queue. > > 4, After up:reconnect, when retrying the getattr requests the MDS will > return ESTALE. > > This should be fixed in https://github.com/ceph/ceph/pull/41667 > recently, it will just enables trim() in up:active state. > > I also went through the ESTALE related code in MDS, this patch still > makes sense and when getting an ESTALE errno to retry the request make > no sense. Thanks for checking; this sounds good to me. Acked-by: Greg Farnum <gfarnum@xxxxxxxxxx> > > BRs > > Xiubo > >