Re: [RFC PATCH 10/11] ceph: perform asynchronous unlink if we have sufficient caps

"Yan, Zheng" <ukernel@xxxxxxxxx> · Thu, 11 Apr 2019 10:50:55 +0800

On Thu, Apr 11, 2019 at 8:16 AM Gregory Farnum <gfarnum@xxxxxxxxxx> wrote:
>
> On Wed, Apr 10, 2019 at 3:45 PM Jeff Layton <jlayton@xxxxxxxxxx> wrote:
> >
> > On Wed, Apr 10, 2019 at 5:54 PM Gregory Farnum <gfarnum@xxxxxxxxxx> wrote:
> > >
> > > On Wed, Apr 10, 2019 at 2:11 PM Jeff Layton <jlayton@xxxxxxxxxx> wrote:
> > > >
> > > > On Wed, Apr 10, 2019 at 4:21 PM Patrick Donnelly <pdonnell@xxxxxxxxxx> wrote:
> > > > >
> > > > > On Wed, Apr 10, 2019 at 7:21 AM Jeff Layton <jlayton@xxxxxxxxxxxxxxx> wrote:
> > > > > > > > >
> > > > > > > > > holding caps for request may cause deadlock.  For example
> > > > > > > > >
> > > > > > > > > - client hold Fx caps and send unlink request
> > > > > > > > > - mds process request from other client, it change filelock's state to
> > > > > > > > > EXCL_FOO and revoke Fx caps
> > > > > > > > > - mds receives the unlink request, it can't process it because it
> > > > > > > > > can't acquire wrlock on filelock
> > > > > > > > >
> > > > > > > > > filelock state stays in EXCL_FOO because client does not release Fx caps.
> > > > > > > > >
> > > > > > > >
> > > > > > > > The client doing the unlink may have received a revoke for Fx on the
> > > > > > > > dir at that point, but it won't have returned it yet. Shouldn't it
> > > > > > > > still be considered to hold Fx on the dir until that happens?
> > > > > > > >
> > > > > > >
> > > > > > > Client should release the Fx. But there is a problem, mds process
> > > > > > > other request first after it get the release of Fx
> > > > > > >
> > > > > >
> > > > > > As I envisioned it, the client would hold a reference to Fx while the
> > > > > > unlink is in flight, so it would not return Fx until after the unlink
> > > > > > has gotten an unsafe reply.
> > > > >
> > > > > This was my understanding as well. It seems to me that the correct
> > > > > thing to do is to move forward with the understanding that the client
> > > > > has a write lock on the filelock state for the directory inode (for Fx
> > > > > cap) and a write lock on the linklock for the file inode (for the Lx
> > > > > cap). Obtaining those locks should require cap revocation which would
> > > > > cause the client to flush its buffered async unlinks. Importantly --
> > > > > and what actually needs to change (?): the MDS should skip acquiring
> > > > > those locks because the client already has the appropriate caps.
> > > > >
> > > > > Does that work Zheng?
> > > > >
> > > >
> > > > I'm not sure it will. IIUC...
> > > >
> > > > I think part of what Zheng is pointing out is that when we assume that
> > > > the client already holds certain locks, then we are effectively
> > > > changing the order in which they can be acquired. That can leave us
> > > > subject to ABBA style deadlocks (though with all of the complexity
> > > > that class Locker provides).
> > > >
> > > > That in and of itself wouldn't be a problem if the MDS code didn't
> > > > wait synchronously on cap revokes in some cases (which Zheng pointed
> > > > out). Fixing that latter bit seems like it might be a big win for
> > > > parallelism, in addition to making async calls more possible.
> > >
> > > Well it's not literally synchronous; it's just that the MDS holds on
> > > to the locks it's already taken. That's why you can see failure cases
> > > where the MDS is still running and resolving requests but there's a
> > > particular client which is stuck with one operation that never moves
> > > forward.
> > >
> >
> > Got it, thanks.
> >
> > Could we resolve this by just unwinding the held locks in that case
> > before requeueing the request? Then just reacquire them when we
> > reattempt it. We might need a livelock avoidance mechanism but that
> > doesn't sound too conceptually difficult.
>
> I'm hesitant to speak too authoritatively after this long and missing
> that it would apparently be trivially subject to ABBA, but the things
> that concern me about it are:
> 1) the freezing process for snapshots and exports rely on waiting
> until all acquired locks have been dropped. We could try and reference
> count "requests-that-will-ask-for-these-again" but it gets difficult
> 2) Livelock avoidance mechanisms are always difficult?
> 3) The locking and caps infrastructures are supposed to let us handle
> this. I grant you they're annoyingly difficult and undocumented; that
> needs to be fixed (and much of it CAN be!). I don't think designing
> new systems to avoid interacting with them is the right approach to
> take.
>
> I think I mentioned before that we really need to draw out the caps
> needed and how they interact. Why does it need more than the FxLx as
> we've discussed, Zheng? Is it unlikely the client can hold the
> necessary pieces upfront?

For the unlink case, mds also needs to wrlock nestlock and rdlock
snaplock. These locks are not covered by cap mechanism.

> -Greg