Re: reducing s_mutex coverage in kcephfs client

Jeff Layton <jlayton@xxxxxxxxxx> · Fri, 27 Mar 2020 15:47:24 -0400

On Fri, 2020-03-27 at 22:31 +0800, Yan, Zheng wrote:
> On Fri, Mar 27, 2020 at 12:58 AM Jeff Layton <jlayton@xxxxxxxxxx> wrote:
> > I had mentioned this in standup this morning, but it's a bit of a
> > complex topic and Zheng asked me to send email instead. I'm also cc'ing
> > ceph-devel for posterity...
> > 
> > The locking in the cap handling code is extremely hairy, with many
> > places where we need to take sleeping locks while we're in atomic
> > context (under spinlock, mostly). A lot of the problem is due to the
> > need to take the session->s_mutex.
> > 
> > For instance, there's this in ceph_check_caps:
> > 
> > a8599bd821d08 (Sage Weil           2009-10-06 11:31:12 -0700 1999)              if (session && session != cap->session) {
> > a8599bd821d08 (Sage Weil           2009-10-06 11:31:12 -0700 2000)                      dout("oops, wrong session %p mutex\n", session);
> > a8599bd821d08 (Sage Weil           2009-10-06 11:31:12 -0700 2001)                      mutex_unlock(&session->s_mutex);
> > a8599bd821d08 (Sage Weil           2009-10-06 11:31:12 -0700 2002)                      session = NULL;
> > a8599bd821d08 (Sage Weil           2009-10-06 11:31:12 -0700 2003)              }
> > a8599bd821d08 (Sage Weil           2009-10-06 11:31:12 -0700 2004)              if (!session) {
> > a8599bd821d08 (Sage Weil           2009-10-06 11:31:12 -0700 2005)                      session = cap->session;
> > a8599bd821d08 (Sage Weil           2009-10-06 11:31:12 -0700 2006)                      if (mutex_trylock(&session->s_mutex) == 0) {
> > a8599bd821d08 (Sage Weil           2009-10-06 11:31:12 -0700 2007)                              dout("inverting session/ino locks on %p\n",
> > a8599bd821d08 (Sage Weil           2009-10-06 11:31:12 -0700 2008)                                   session);
> > be655596b3de5 (Sage Weil           2011-11-30 09:47:09 -0800 2009)                              spin_unlock(&ci->i_ceph_lock);
> > a8599bd821d08 (Sage Weil           2009-10-06 11:31:12 -0700 2010)                              if (took_snap_rwsem) {
> > a8599bd821d08 (Sage Weil           2009-10-06 11:31:12 -0700 2011)                                      up_read(&mdsc->snap_rwsem);
> > a8599bd821d08 (Sage Weil           2009-10-06 11:31:12 -0700 2012)                                      took_snap_rwsem = 0;
> > a8599bd821d08 (Sage Weil           2009-10-06 11:31:12 -0700 2013)                              }
> > a8599bd821d08 (Sage Weil           2009-10-06 11:31:12 -0700 2014)                              mutex_lock(&session->s_mutex);
> > a8599bd821d08 (Sage Weil           2009-10-06 11:31:12 -0700 2015)                              goto retry;
> > a8599bd821d08 (Sage Weil           2009-10-06 11:31:12 -0700 2016)                      }
> > a8599bd821d08 (Sage Weil           2009-10-06 11:31:12 -0700 2017)              }
> > 
> > At this point, we're walking the inode's caps rbtree, while holding the
> > inode->i_ceph_lock. We're eventually going to need to send a cap message
> > to the MDS for this cap, but that requires the cap->session->s_mutex. We
> > try to take it without blocking first, but if that fails, we have to
> > unwind all of the locking and start over. Gross. That also makes the
> > handling of snap_rwsem much more complex than it really should be too.
> > 
> > It does this, despite the fact that the cap message doesn't actually
> > need much from the session (just the session->s_con, mostly). Most of
> > the info in the message comes from the inode and cap objects.
> > 
> > My question is: What is the s_mutex guaranteeing at this point?
> > 
> > More to the point, is it strictly required that we hold that mutex as we
> > marshal up the outgoing request? It would be much cleaner to be able to
> > just drop the spinlock after getting the ceph_msg_args ready to send,
> > then take the session mutex and send the request.
> > 
> > The state of the MDS session is not checked in this codepath before the
> > send, so it doesn't seem like ordering vs. session state messages is
> > very important. This _is_ ordered vs. regular MDS requests, but a
> > per-session mutex seems like a very heavyweight way to do that.
> > 
> > If we're concerned about reordering cap messages that involve the same
> > inode, then there are other ways to ensure that ordering that don't
> > require a coarse-grained mutex.
> > 
> > It's just not clear to me what data this mutex is protecting in this
> > case.
> 
> I think it's mainly for message ordering. For example,  a request may
> release multiple inodes' caps (by ceph_encode_inode_release).  Before
> sending the request out, we need to prevent ceph_check_caps() from
> touch these inodes' caps and sending cap messages.

I don't get it.

AFAICT, ceph_encode_inode_release is called while holding the
mdsc->mutex, not the s_mutex. That is serialized on the i_ceph_lock, but
I don't think there's any guarantee what order (e.g.) a racing cap
update and release would be sent.

--
Jeff Layton <jlayton@xxxxxxxxxx>