Re: [RFC PATCH 10/11] ceph: perform asynchronous unlink if we have sufficient caps

"Yan, Zheng" <ukernel@xxxxxxxxx> · Wed, 17 Apr 2019 21:20:06 +0800



On Thu, Apr 11, 2019 at 7:31 PM Jeff Layton <jlayton@xxxxxxxxxx> wrote:
>
> On Wed, Apr 10, 2019 at 11:10 PM Yan, Zheng <ukernel@xxxxxxxxx> wrote:
> >
> > On Wed, Apr 10, 2019 at 10:21 PM Jeff Layton <jlayton@xxxxxxxxxxxxxxx> wrote:
> > >
> > > On Wed, Apr 10, 2019 at 10:11 AM Yan, Zheng <ukernel@xxxxxxxxx> wrote:
> > > >
> > > > On Wed, Apr 10, 2019 at 9:22 PM Jeff Layton <jlayton@xxxxxxxxxxxxxxx> wrote:
> > > > >
> > > > > On Wed, Apr 10, 2019 at 9:06 AM Yan, Zheng <ukernel@xxxxxxxxx> wrote:
> > > > > >
> > > > > > On Wed, Apr 10, 2019 at 3:44 AM Jeff Layton <jlayton@xxxxxxxxxx> wrote:
> > > > > > >
> > > > > > > When performing an unlink, if we have Fx on the parent directory and
> > > > > > > Lx on the inode of the dentry being unlinked then we don't need to
> > > > > > > wait on the reply before returning.
> > > > > > >
> > > > > > > In that situation, just fix up the dcache and link count and return
> > > > > > > immediately after issuing the call to the MDS. This does mean that we
> > > > > > > need to hold an extra reference to the inode being unlinked, and extra
> > > > > > > references to the caps to avoid races. Put those references in the
> > > > > > > r_callback routine.
> > > > > > >
> > > > > > > If the operation ends up failing, then set a writeback error on the
> > > > > > > directory inode that can be fetched later by an fsync on the dir.
> > > > > > >
> > > > > > > Signed-off-by: Jeff Layton <jlayton@xxxxxxxxxx>
> > > > > > > ---
> > > > > > >  fs/ceph/caps.c  | 22 ++++++++++++++++++++++
> > > > > > >  fs/ceph/dir.c   | 38 +++++++++++++++++++++++++++++++++++---
> > > > > > >  fs/ceph/super.h |  1 +
> > > > > > >  3 files changed, 58 insertions(+), 3 deletions(-)
> > > > > > >
> > > > > > > diff --git a/fs/ceph/caps.c b/fs/ceph/caps.c
> > > > > > > index d022e15c8378..8fbb09c761a7 100644
> > > > > > > --- a/fs/ceph/caps.c
> > > > > > > +++ b/fs/ceph/caps.c
> > > > > > > @@ -2682,6 +2682,28 @@ static int try_get_cap_refs(struct ceph_inode_info *ci, int need, int want,
> > > > > > >         return ret;
> > > > > > >  }
> > > > > > >
> > > > > > > +bool ceph_get_caps_for_unlink(struct inode *dir, struct dentry *dentry)
> > > > > > > +{
> > > > > > > +       int err, got;
> > > > > > > +       struct ceph_inode_info *ci = ceph_inode(d_inode(dentry));
> > > > > > > +
> > > > > > > +       /* Ensure we have Lx on the inode being unlinked */
> > > > > > > +       err = try_get_cap_refs(ci, 0, CEPH_CAP_LINK_EXCL, 0, true, &got);
> > > > > > > +       dout("Lx on %p err=%d got=%d\n", dentry, err, got);
> > > > > > > +       if (err != 1 || !(got & CEPH_CAP_LINK_EXCL))
> > > > > > > +               return false;
> > > > > > > +
> > > > > > > +       /* Do we have Fx on the dir ? */
> > > > > > > +       err = try_get_cap_refs(ceph_inode(dir), 0, CEPH_CAP_FILE_EXCL, 0,
> > > > > > > +                               true, &got);
> > > > > > > +       dout("Fx on %p err=%d got=%d\n", dir, err, got);
> > > > > > > +       if (err != 1 || !(got & CEPH_CAP_FILE_EXCL)) {
> > > > > > > +               ceph_put_cap_refs(ci, CEPH_CAP_LINK_EXCL);
> > > > > > > +               return false;
> > > > > > > +       }
> > > > > > > +       return true;
> > > > > > > +}
> > > > > >
> > > > > > holding caps for request may cause deadlock.  For example
> > > > > >
> > > > > > - client hold Fx caps and send unlink request
> > > > > > - mds process request from other client, it change filelock's state to
> > > > > > EXCL_FOO and revoke Fx caps
> > > > > > - mds receives the unlink request, it can't process it because it
> > > > > > can't acquire wrlock on filelock
> > > > > >
> > > > > > filelock state stays in EXCL_FOO because client does not release Fx caps.
> > > > > >
> > > > >
> > > > > The client doing the unlink may have received a revoke for Fx on the
> > > > > dir at that point, but it won't have returned it yet. Shouldn't it
> > > > > still be considered to hold Fx on the dir until that happens?
> > > > >
> > > >
> > > > Client should release the Fx. But there is a problem, mds process
> > > > other request first after it get the release of Fx
> > > >
> > >
> > > As I envisioned it, the client would hold a reference to Fx while the
> > > unlink is in flight, so it would not return Fx until after the unlink
> > > has gotten an unsafe reply.
> > >
> > > It seems like when the MDS tries to get the locks for the request from
> > > the second client and fails, it should then drop any locks it did get
> > > and requeue the request to be run when the caps are returned. If we're
> > > blocking for these locks, then we're tying up a thread while we wait
> > > on a cap release, which seems really inefficient.
> > >
> > > That said, I don't have a great understanding of the MDS code so maybe
> > > I'm missing something.
> > >
> > > >
> > > > > > I checked MDS code, supporting async operation is more complex than I
> > > > > > thought.  For the unlink case, mds needs to acquire locks other than
> > > > > > filelock and linklock.
> > > > > >
> > > > > > - Request A from client1 has acquired lock X, waiting for wrlock of
> > > > > > filelock.
> > > > > > - Client2 sends async unlink  request 'B' to mds
> > > > > > - MDS processes request 'B'. It needs to lock X and filelock.  But it
> > > > > > can't get lock X because request A holds it. Request A can get
> > > > > > filelock because Client2 hold Fx caps.
> > > > > >
> > > > >
> > > > > This sounds like a bog-standard ABBA ordering issue.
> > > > >
> > > > > If any request can't get any of the locks it needs, shouldn't it just
> > > > > unwind everything at that point and be to try again later? In fact, it
> > > > > sort of looks like that's what the MDS does, but the Locker code in
> > > > > the MDS is really hard to understand.
> > > > >
> > > >
> > > > In some special cases, mds does that.  For normal cases, mds just
> > > > acquires locks in fixed order. See Locker::acquire_locks and
> > > > SimpleLock::ptr_lt
> > > >
> > >
> > > I sort of see. Ok.
> > >
> > >
> > > > > > I think a way to handle this is delegating locks hold by unsafe
> > > > > > request to client. For example:
> > > > > >
> > > > > > - mds processes unlink request from client A,  sends unsafe reply and
> > > > > > delegates acquired locks to client
> > > > > > - client got the delegation, it can do async unlink on the same
> > > > > > directory as  long as it hold the delegation
> > > > > >
> > > > >
> > > > > So this would be a proposed new object that we'd be handing to
> > > > > clients? I'd have to understand what this means in practice.
> > > > >
> > > >
> > > > For requests that do unlink in the same directory, they should acquire
> > > > similar locks (the only difference is xlock on dentry). When sending
> > > > unsafe reply for a unlink request, mds keeps the acquired locks for
> > > > later unlink in the same directory. The unsafe reply includes a
> > > > delegation for unlink operation in the directory.  Client can do
> > > > unlink locally while it holds the delegation. (mds keeps locks for
> > > > client as long as it holds the delegation)  This is my rough idea,
> > > > I'm still thinking how to handle xlock on dentry.
> > > >
> > >
> > > I think we might be better off reworking cap handling semantics rather
> > > than layering a new mechanism on top for this. That said, I keep
> > > finding ways that the cap subsystem doesn't work like I'd think it
> > > would, so maybe I'm wrong here.
> > >
> >
> > please be more specific. what  semantics do you want to have?
> >
>
> Basically, at a high level:
>
> 1) have the MDS avoid taking a lock when processing a request from a
> client that holds corresponding caps. We may need to take steps to
> prevent races here, particularly to deal with clients that may not
> hold references to caps when issuing a request. For instance, a client
> could return Fx caps on the dir while it has an unlink request
> outstanding. In that event, we'd need to ensure that nothing else can
> take the filelock until that request is done.
>
> 2) when the MDS is processing a request and can't get a lock because
> it's waiting on a client to return a corresponding cap, then it should
> release any locks taken so far before requeueing the request. It seems
> like that should be enough to prevent the sort of deadlock you
> outlined yesterday, but I don't know how difficult that would be. We
> may also need to take steps to avoid livelocks. For instance, ensure
> that we don't hand out caps corresponding to locks that a requeued
> call will need until it has been completed.
>

FYI: https://github.com/ceph/ceph/pull/27648

> --
> Jeff Layton <jlayton@xxxxxxxxxx>