On Mon, 2022-02-07 at 08:28 -0800, Gregory Farnum wrote: > On Mon, Feb 7, 2022 at 8:13 AM Jeff Layton <jlayton@xxxxxxxxxx> wrote: > > The tracker bug mentions that this occurs after an MDS is restarted. > > Could this be the result of clients relying on delete-on-last-close > > behavior? > > Oooh, I didn't actually look at the tracker. > > > > > IOW, we have a situation where a file is opened and then unlinked, and > > userland is actively doing I/O to it. The thing gets moved into the > > strays dir, but isn't unlinked yet because we have open files against > > it. Everything works fine at this point... > > > > Then, the MDS restarts and the inode gets purged altogether. Client > > reconnects and tries to reclaim his open, and gets ESTALE. > > Uh, okay. So I didn't do a proper audit before I sent my previous > reply, but one of the cases I did see was that the MDS returns ESTALE > if you try to do a name lookup on an inode in the stray directory. I > don't know if that's what is happening here or not? But perhaps that's > the root of the problem in this case. > > Oh, nope, I see it's issuing getattr requests. That doesn't do ESTALE > directly so it must indeed be coming out of MDCache::path_traverse. > > The MDS shouldn't move an inode into the purge queue on restart unless > there were no clients with caps on it (that state is persisted to disk > so it knows). Maybe if the clients don't make the reconnect window > it's dropping them all and *then* moves it into purge queue? I think > we need to identify what's happening there before we issue kernel > client changes, Xiubo? Agreed. I think we need to understand why he's seeing ESTALE errors in the first place, but it sounds like retrying on an ESTALE error isn't likely to be helpful. -- Jeff Layton <jlayton@xxxxxxxxxx>