On Thu, Apr 11, 2019 at 7:31 PM Jeff Layton <jlayton@xxxxxxxxxx> wrote: > > On Wed, Apr 10, 2019 at 11:10 PM Yan, Zheng <ukernel@xxxxxxxxx> wrote: > > > > On Wed, Apr 10, 2019 at 10:21 PM Jeff Layton <jlayton@xxxxxxxxxxxxxxx> wrote: > > > > > > On Wed, Apr 10, 2019 at 10:11 AM Yan, Zheng <ukernel@xxxxxxxxx> wrote: > > > > > > > > On Wed, Apr 10, 2019 at 9:22 PM Jeff Layton <jlayton@xxxxxxxxxxxxxxx> wrote: > > > > > > > > > > On Wed, Apr 10, 2019 at 9:06 AM Yan, Zheng <ukernel@xxxxxxxxx> wrote: > > > > > > > > > > > > On Wed, Apr 10, 2019 at 3:44 AM Jeff Layton <jlayton@xxxxxxxxxx> wrote: > > > > > > > > > > > > > > When performing an unlink, if we have Fx on the parent directory and > > > > > > > Lx on the inode of the dentry being unlinked then we don't need to > > > > > > > wait on the reply before returning. > > > > > > > > > > > > > > In that situation, just fix up the dcache and link count and return > > > > > > > immediately after issuing the call to the MDS. This does mean that we > > > > > > > need to hold an extra reference to the inode being unlinked, and extra > > > > > > > references to the caps to avoid races. Put those references in the > > > > > > > r_callback routine. > > > > > > > > > > > > > > If the operation ends up failing, then set a writeback error on the > > > > > > > directory inode that can be fetched later by an fsync on the dir. > > > > > > > > > > > > > > Signed-off-by: Jeff Layton <jlayton@xxxxxxxxxx> > > > > > > > --- > > > > > > > fs/ceph/caps.c | 22 ++++++++++++++++++++++ > > > > > > > fs/ceph/dir.c | 38 +++++++++++++++++++++++++++++++++++--- > > > > > > > fs/ceph/super.h | 1 + > > > > > > > 3 files changed, 58 insertions(+), 3 deletions(-) > > > > > > > > > > > > > > diff --git a/fs/ceph/caps.c b/fs/ceph/caps.c > > > > > > > index d022e15c8378..8fbb09c761a7 100644 > > > > > > > --- a/fs/ceph/caps.c > > > > > > > +++ b/fs/ceph/caps.c > > > > > > > @@ -2682,6 +2682,28 @@ static int try_get_cap_refs(struct ceph_inode_info *ci, int need, int want, > > > > > > > return ret; > > > > > > > } > > > > > > > > > > > > > > +bool ceph_get_caps_for_unlink(struct inode *dir, struct dentry *dentry) > > > > > > > +{ > > > > > > > + int err, got; > > > > > > > + struct ceph_inode_info *ci = ceph_inode(d_inode(dentry)); > > > > > > > + > > > > > > > + /* Ensure we have Lx on the inode being unlinked */ > > > > > > > + err = try_get_cap_refs(ci, 0, CEPH_CAP_LINK_EXCL, 0, true, &got); > > > > > > > + dout("Lx on %p err=%d got=%d\n", dentry, err, got); > > > > > > > + if (err != 1 || !(got & CEPH_CAP_LINK_EXCL)) > > > > > > > + return false; > > > > > > > + > > > > > > > + /* Do we have Fx on the dir ? */ > > > > > > > + err = try_get_cap_refs(ceph_inode(dir), 0, CEPH_CAP_FILE_EXCL, 0, > > > > > > > + true, &got); > > > > > > > + dout("Fx on %p err=%d got=%d\n", dir, err, got); > > > > > > > + if (err != 1 || !(got & CEPH_CAP_FILE_EXCL)) { > > > > > > > + ceph_put_cap_refs(ci, CEPH_CAP_LINK_EXCL); > > > > > > > + return false; > > > > > > > + } > > > > > > > + return true; > > > > > > > +} > > > > > > > > > > > > holding caps for request may cause deadlock. For example > > > > > > > > > > > > - client hold Fx caps and send unlink request > > > > > > - mds process request from other client, it change filelock's state to > > > > > > EXCL_FOO and revoke Fx caps > > > > > > - mds receives the unlink request, it can't process it because it > > > > > > can't acquire wrlock on filelock > > > > > > > > > > > > filelock state stays in EXCL_FOO because client does not release Fx caps. > > > > > > > > > > > > > > > > The client doing the unlink may have received a revoke for Fx on the > > > > > dir at that point, but it won't have returned it yet. Shouldn't it > > > > > still be considered to hold Fx on the dir until that happens? > > > > > > > > > > > > > Client should release the Fx. But there is a problem, mds process > > > > other request first after it get the release of Fx > > > > > > > > > > As I envisioned it, the client would hold a reference to Fx while the > > > unlink is in flight, so it would not return Fx until after the unlink > > > has gotten an unsafe reply. > > > > > > It seems like when the MDS tries to get the locks for the request from > > > the second client and fails, it should then drop any locks it did get > > > and requeue the request to be run when the caps are returned. If we're > > > blocking for these locks, then we're tying up a thread while we wait > > > on a cap release, which seems really inefficient. > > > > > > That said, I don't have a great understanding of the MDS code so maybe > > > I'm missing something. > > > > > > > > > > > > > I checked MDS code, supporting async operation is more complex than I > > > > > > thought. For the unlink case, mds needs to acquire locks other than > > > > > > filelock and linklock. > > > > > > > > > > > > - Request A from client1 has acquired lock X, waiting for wrlock of > > > > > > filelock. > > > > > > - Client2 sends async unlink request 'B' to mds > > > > > > - MDS processes request 'B'. It needs to lock X and filelock. But it > > > > > > can't get lock X because request A holds it. Request A can get > > > > > > filelock because Client2 hold Fx caps. > > > > > > > > > > > > > > > > This sounds like a bog-standard ABBA ordering issue. > > > > > > > > > > If any request can't get any of the locks it needs, shouldn't it just > > > > > unwind everything at that point and be to try again later? In fact, it > > > > > sort of looks like that's what the MDS does, but the Locker code in > > > > > the MDS is really hard to understand. > > > > > > > > > > > > > In some special cases, mds does that. For normal cases, mds just > > > > acquires locks in fixed order. See Locker::acquire_locks and > > > > SimpleLock::ptr_lt > > > > > > > > > > I sort of see. Ok. > > > > > > > > > > > > I think a way to handle this is delegating locks hold by unsafe > > > > > > request to client. For example: > > > > > > > > > > > > - mds processes unlink request from client A, sends unsafe reply and > > > > > > delegates acquired locks to client > > > > > > - client got the delegation, it can do async unlink on the same > > > > > > directory as long as it hold the delegation > > > > > > > > > > > > > > > > So this would be a proposed new object that we'd be handing to > > > > > clients? I'd have to understand what this means in practice. > > > > > > > > > > > > > For requests that do unlink in the same directory, they should acquire > > > > similar locks (the only difference is xlock on dentry). When sending > > > > unsafe reply for a unlink request, mds keeps the acquired locks for > > > > later unlink in the same directory. The unsafe reply includes a > > > > delegation for unlink operation in the directory. Client can do > > > > unlink locally while it holds the delegation. (mds keeps locks for > > > > client as long as it holds the delegation) This is my rough idea, > > > > I'm still thinking how to handle xlock on dentry. > > > > > > > > > > I think we might be better off reworking cap handling semantics rather > > > than layering a new mechanism on top for this. That said, I keep > > > finding ways that the cap subsystem doesn't work like I'd think it > > > would, so maybe I'm wrong here. > > > > > > > please be more specific. what semantics do you want to have? > > > > Basically, at a high level: > > 1) have the MDS avoid taking a lock when processing a request from a > client that holds corresponding caps. We may need to take steps to > prevent races here, particularly to deal with clients that may not > hold references to caps when issuing a request. For instance, a client > could return Fx caps on the dir while it has an unlink request > outstanding. In that event, we'd need to ensure that nothing else can > take the filelock until that request is done. > > 2) when the MDS is processing a request and can't get a lock because > it's waiting on a client to return a corresponding cap, then it should > release any locks taken so far before requeueing the request. It seems > like that should be enough to prevent the sort of deadlock you > outlined yesterday, but I don't know how difficult that would be. We > may also need to take steps to avoid livelocks. For instance, ensure > that we don't hand out caps corresponding to locks that a requeued > call will need until it has been completed. > FYI: https://github.com/ceph/ceph/pull/27648 > -- > Jeff Layton <jlayton@xxxxxxxxxx>