Re: [RFC PATCH 10/11] ceph: perform asynchronous unlink if we have sufficient caps

"Yan, Zheng" <ukernel@xxxxxxxxx> · Thu, 11 Apr 2019 11:10:42 +0800

On Wed, Apr 10, 2019 at 10:21 PM Jeff Layton <jlayton@xxxxxxxxxxxxxxx> wrote:
>
> On Wed, Apr 10, 2019 at 10:11 AM Yan, Zheng <ukernel@xxxxxxxxx> wrote:
> >
> > On Wed, Apr 10, 2019 at 9:22 PM Jeff Layton <jlayton@xxxxxxxxxxxxxxx> wrote:
> > >
> > > On Wed, Apr 10, 2019 at 9:06 AM Yan, Zheng <ukernel@xxxxxxxxx> wrote:
> > > >
> > > > On Wed, Apr 10, 2019 at 3:44 AM Jeff Layton <jlayton@xxxxxxxxxx> wrote:
> > > > >
> > > > > When performing an unlink, if we have Fx on the parent directory and
> > > > > Lx on the inode of the dentry being unlinked then we don't need to
> > > > > wait on the reply before returning.
> > > > >
> > > > > In that situation, just fix up the dcache and link count and return
> > > > > immediately after issuing the call to the MDS. This does mean that we
> > > > > need to hold an extra reference to the inode being unlinked, and extra
> > > > > references to the caps to avoid races. Put those references in the
> > > > > r_callback routine.
> > > > >
> > > > > If the operation ends up failing, then set a writeback error on the
> > > > > directory inode that can be fetched later by an fsync on the dir.
> > > > >
> > > > > Signed-off-by: Jeff Layton <jlayton@xxxxxxxxxx>
> > > > > ---
> > > > >  fs/ceph/caps.c  | 22 ++++++++++++++++++++++
> > > > >  fs/ceph/dir.c   | 38 +++++++++++++++++++++++++++++++++++---
> > > > >  fs/ceph/super.h |  1 +
> > > > >  3 files changed, 58 insertions(+), 3 deletions(-)
> > > > >
> > > > > diff --git a/fs/ceph/caps.c b/fs/ceph/caps.c
> > > > > index d022e15c8378..8fbb09c761a7 100644
> > > > > --- a/fs/ceph/caps.c
> > > > > +++ b/fs/ceph/caps.c
> > > > > @@ -2682,6 +2682,28 @@ static int try_get_cap_refs(struct ceph_inode_info *ci, int need, int want,
> > > > >         return ret;
> > > > >  }
> > > > >
> > > > > +bool ceph_get_caps_for_unlink(struct inode *dir, struct dentry *dentry)
> > > > > +{
> > > > > +       int err, got;
> > > > > +       struct ceph_inode_info *ci = ceph_inode(d_inode(dentry));
> > > > > +
> > > > > +       /* Ensure we have Lx on the inode being unlinked */
> > > > > +       err = try_get_cap_refs(ci, 0, CEPH_CAP_LINK_EXCL, 0, true, &got);
> > > > > +       dout("Lx on %p err=%d got=%d\n", dentry, err, got);
> > > > > +       if (err != 1 || !(got & CEPH_CAP_LINK_EXCL))
> > > > > +               return false;
> > > > > +
> > > > > +       /* Do we have Fx on the dir ? */
> > > > > +       err = try_get_cap_refs(ceph_inode(dir), 0, CEPH_CAP_FILE_EXCL, 0,
> > > > > +                               true, &got);
> > > > > +       dout("Fx on %p err=%d got=%d\n", dir, err, got);
> > > > > +       if (err != 1 || !(got & CEPH_CAP_FILE_EXCL)) {
> > > > > +               ceph_put_cap_refs(ci, CEPH_CAP_LINK_EXCL);
> > > > > +               return false;
> > > > > +       }
> > > > > +       return true;
> > > > > +}
> > > >
> > > > holding caps for request may cause deadlock.  For example
> > > >
> > > > - client hold Fx caps and send unlink request
> > > > - mds process request from other client, it change filelock's state to
> > > > EXCL_FOO and revoke Fx caps
> > > > - mds receives the unlink request, it can't process it because it
> > > > can't acquire wrlock on filelock
> > > >
> > > > filelock state stays in EXCL_FOO because client does not release Fx caps.
> > > >
> > >
> > > The client doing the unlink may have received a revoke for Fx on the
> > > dir at that point, but it won't have returned it yet. Shouldn't it
> > > still be considered to hold Fx on the dir until that happens?
> > >
> >
> > Client should release the Fx. But there is a problem, mds process
> > other request first after it get the release of Fx
> >
>
> As I envisioned it, the client would hold a reference to Fx while the
> unlink is in flight, so it would not return Fx until after the unlink
> has gotten an unsafe reply.
>
> It seems like when the MDS tries to get the locks for the request from
> the second client and fails, it should then drop any locks it did get
> and requeue the request to be run when the caps are returned. If we're
> blocking for these locks, then we're tying up a thread while we wait
> on a cap release, which seems really inefficient.
>
> That said, I don't have a great understanding of the MDS code so maybe
> I'm missing something.
>
> >
> > > > I checked MDS code, supporting async operation is more complex than I
> > > > thought.  For the unlink case, mds needs to acquire locks other than
> > > > filelock and linklock.
> > > >
> > > > - Request A from client1 has acquired lock X, waiting for wrlock of
> > > > filelock.
> > > > - Client2 sends async unlink  request 'B' to mds
> > > > - MDS processes request 'B'. It needs to lock X and filelock.  But it
> > > > can't get lock X because request A holds it. Request A can get
> > > > filelock because Client2 hold Fx caps.
> > > >
> > >
> > > This sounds like a bog-standard ABBA ordering issue.
> > >
> > > If any request can't get any of the locks it needs, shouldn't it just
> > > unwind everything at that point and be to try again later? In fact, it
> > > sort of looks like that's what the MDS does, but the Locker code in
> > > the MDS is really hard to understand.
> > >
> >
> > In some special cases, mds does that.  For normal cases, mds just
> > acquires locks in fixed order. See Locker::acquire_locks and
> > SimpleLock::ptr_lt
> >
>
> I sort of see. Ok.
>
>
> > > > I think a way to handle this is delegating locks hold by unsafe
> > > > request to client. For example:
> > > >
> > > > - mds processes unlink request from client A,  sends unsafe reply and
> > > > delegates acquired locks to client
> > > > - client got the delegation, it can do async unlink on the same
> > > > directory as  long as it hold the delegation
> > > >
> > >
> > > So this would be a proposed new object that we'd be handing to
> > > clients? I'd have to understand what this means in practice.
> > >
> >
> > For requests that do unlink in the same directory, they should acquire
> > similar locks (the only difference is xlock on dentry). When sending
> > unsafe reply for a unlink request, mds keeps the acquired locks for
> > later unlink in the same directory. The unsafe reply includes a
> > delegation for unlink operation in the directory.  Client can do
> > unlink locally while it holds the delegation. (mds keeps locks for
> > client as long as it holds the delegation)  This is my rough idea,
> > I'm still thinking how to handle xlock on dentry.
> >
>
> I think we might be better off reworking cap handling semantics rather
> than layering a new mechanism on top for this. That said, I keep
> finding ways that the cap subsystem doesn't work like I'd think it
> would, so maybe I'm wrong here.
>

please be more specific. what  semantics do you want to have?

Regards
Yan, Zheng

> > > >
> > > >
> > > > > +
> > > > >  /*
> > > > >   * Check the offset we are writing up to against our current
> > > > >   * max_size.  If necessary, tell the MDS we want to write to
> > > > > diff --git a/fs/ceph/dir.c b/fs/ceph/dir.c
> > > > > index 3eb9bc226b77..386c9439a020 100644
> > > > > --- a/fs/ceph/dir.c
> > > > > +++ b/fs/ceph/dir.c
> > > > > @@ -1029,6 +1029,18 @@ static int ceph_link(struct dentry *old_dentry, struct inode *dir,
> > > > >         return err;
> > > > >  }
> > > > >
> > > > > +static void ceph_async_unlink_cb(struct ceph_mds_client *mdsc,
> > > > > +                                struct ceph_mds_request *req)
> > > > > +{
> > > > > +       struct ceph_inode_info *ci = ceph_inode(req->r_old_inode);
> > > > > +
> > > > > +       /* If op failed, set error on parent directory */
> > > > > +       mapping_set_error(req->r_parent->i_mapping, req->r_err);
> > > > > +       ceph_put_cap_refs(ci, CEPH_CAP_LINK_EXCL);
> > > > > +       ceph_put_cap_refs(ceph_inode(req->r_parent), CEPH_CAP_FILE_EXCL);
> > > > > +       iput(req->r_old_inode);
> > > > > +}
> > > > > +
> > > > >  /*
> > > > >   * rmdir and unlink are differ only by the metadata op code
> > > > >   */
> > > > > @@ -1065,9 +1077,29 @@ static int ceph_unlink(struct inode *dir, struct dentry *dentry)
> > > > >         req->r_dentry_drop = CEPH_CAP_FILE_SHARED;
> > > > >         req->r_dentry_unless = CEPH_CAP_FILE_EXCL;
> > > > >         req->r_inode_drop = ceph_drop_caps_for_unlink(inode);
> > > > > -       err = ceph_mdsc_do_request(mdsc, dir, req);
> > > > > -       if (!err && !req->r_reply_info.head->is_dentry)
> > > > > -               d_delete(dentry);
> > > > > +
> > > > > +       if (op == CEPH_MDS_OP_UNLINK &&
> > > > > +           ceph_get_caps_for_unlink(dir, dentry)) {
> > > > > +               /* Keep LINK caps to ensure continuity over async call */
> > > > > +               req->r_inode_drop &= ~(CEPH_CAP_LINK_SHARED|CEPH_CAP_LINK_EXCL);
> > > > > +               req->r_callback = ceph_async_unlink_cb;
> > > > > +               req->r_old_inode = d_inode(dentry);
> > > > > +               ihold(req->r_old_inode);
> > > > > +               err = ceph_mdsc_submit_request(mdsc, dir, req);
> > > > > +               if (!err) {
> > > > > +                       /*
> > > > > +                        * We have enough caps, so we assume that the unlink
> > > > > +                        * will succeed. Fix up the target inode and dcache.
> > > > > +                        */
> > > > > +                       drop_nlink(inode);
> > > > > +                       d_delete(dentry);
> > > > > +               }
> > > > > +       } else {
> > > > > +               err = ceph_mdsc_do_request(mdsc, dir, req);
> > > > > +               if (!err && !req->r_reply_info.head->is_dentry)
> > > > > +                       d_delete(dentry);
> > > > > +       }
> > > > > +
> > > > >         ceph_mdsc_put_request(req);
> > > > >  out:
> > > > >         return err;
> > > > > diff --git a/fs/ceph/super.h b/fs/ceph/super.h
> > > > > index 3c5608f2108a..5c361dc1f47f 100644
> > > > > --- a/fs/ceph/super.h
> > > > > +++ b/fs/ceph/super.h
> > > > > @@ -1033,6 +1033,7 @@ extern int ceph_get_caps(struct ceph_inode_info *ci, int need, int want,
> > > > >                          loff_t endoff, int *got, struct page **pinned_page);
> > > > >  extern int ceph_try_get_caps(struct ceph_inode_info *ci,
> > > > >                              int need, int want, bool nonblock, int *got);
> > > > > +extern bool ceph_get_caps_for_unlink(struct inode *dir, struct dentry *dentry);
> > > > >
> > > > >  /* for counting open files by mode */
> > > > >  extern void __ceph_get_fmode(struct ceph_inode_info *ci, int mode);
> > > > > --
> > > > > 2.20.1
> > > > >
> > >
> > >
> > >
> > > --
> > > Jeff Layton <jlayton@xxxxxxxxxxxxxxx>
>
>
>
> --
> Jeff Layton <jlayton@xxxxxxxxxxxxxxx>