Re: [PATCH] ceph: fail the request directly if handle_reply gets an ESTALE

Jeff Layton <jlayton@xxxxxxxxxx> · Wed, 09 Feb 2022 06:34:49 -0500

On Wed, 2022-02-09 at 14:00 +0800, Xiubo Li wrote:
> On 2/8/22 1:11 AM, Jeff Layton wrote:
> > On Mon, 2022-02-07 at 08:28 -0800, Gregory Farnum wrote:
> > > On Mon, Feb 7, 2022 at 8:13 AM Jeff Layton <jlayton@xxxxxxxxxx> wrote:
> > > > The tracker bug mentions that this occurs after an MDS is restarted.
> > > > Could this be the result of clients relying on delete-on-last-close
> > > > behavior?
> > > Oooh, I didn't actually look at the tracker.
> > > 
> > > > IOW, we have a situation where a file is opened and then unlinked, and
> > > > userland is actively doing I/O to it. The thing gets moved into the
> > > > strays dir, but isn't unlinked yet because we have open files against
> > > > it. Everything works fine at this point...
> > > > 
> > > > Then, the MDS restarts and the inode gets purged altogether. Client
> > > > reconnects and tries to reclaim his open, and gets ESTALE.
> > > Uh, okay. So I didn't do a proper audit before I sent my previous
> > > reply, but one of the cases I did see was that the MDS returns ESTALE
> > > if you try to do a name lookup on an inode in the stray directory. I
> > > don't know if that's what is happening here or not? But perhaps that's
> > > the root of the problem in this case.
> > > 
> > > Oh, nope, I see it's issuing getattr requests. That doesn't do ESTALE
> > > directly so it must indeed be coming out of MDCache::path_traverse.
> > > 
> > > The MDS shouldn't move an inode into the purge queue on restart unless
> > > there were no clients with caps on it (that state is persisted to disk
> > > so it knows). Maybe if the clients don't make the reconnect window
> > > it's dropping them all and *then* moves it into purge queue? I think
> > > we need to identify what's happening there before we issue kernel
> > > client changes, Xiubo?
> > 
> > Agreed. I think we need to understand why he's seeing ESTALE errors in
> > the first place, but it sounds like retrying on an ESTALE error isn't
> > likely to be helpful.
> 
> There has one case that could cause the inode to be put into the purge 
> queue:
> 
> 1, When unlinking a file and just after the unlink journal log is 
> flushed and the MDS is restart or replaced by a standby MDS. The unlink 
> journal log will contain the a straydn and the straydn will link to the 
> related CInode.
> 
> 2, The new starting MDS will replay this unlink journal log in 
> up:standby_replay state.
> 
> 3, The MDCache::upkeep_main() thread will try to trim MDCache, and it 
> will possibly trim the straydn. Since the clients haven't reconnected 
> the sessions, so the CInode won't have any client cap. So when trimming 
> the straydn and CInode, the CInode will be put into the purge queue.
> 
> 4, After up:reconnect, when retrying the getattr requests the MDS will 
> return ESTALE.
> 
> This should be fixed in https://github.com/ceph/ceph/pull/41667 
> recently, it will just enables trim() in up:active state.
> 
> I also went through the ESTALE related code in MDS, this patch still 
> makes sense and when getting an ESTALE errno to retry the request make 
> no sense.
> 

Agreed. I think retrying an operation directly on an ESTALE makes no
sense, and it probably prevents the ESTALE handling in the vfs layer
from working properly.

Usually, when we get back an ESTALE error on NFS with a path-based
operation, it means that it did a lookup (often out of the client's
cache) and found an inode, but by the time we got around to doing the
operation, the filehandle had vanished from the server. The kernel will
then go and set LOOKUP_REVAL and do the pathwalk again (since that's a
pretty good indicator that the dcache was wrong).

It's a bit different in ceph since it's a (semi-)path-based protocol,
and the lookup cache coherency is tighter, but I think the same sort of
races are probably possible. Allowing ESTALE to bubble back up to the
VFS layer would allow it to retry the lookups and do it again.

I'm going to plan to take this patch into testing branch and aim for it
to go into v5.18.

Thanks!
-- 
Jeff Layton <jlayton@xxxxxxxxxx>