reducing s_mutex coverage in kcephfs client

Jeff Layton <jlayton@xxxxxxxxxx> · Thu, 26 Mar 2020 12:58:28 -0400

I had mentioned this in standup this morning, but it's a bit of a
complex topic and Zheng asked me to send email instead. I'm also cc'ing
ceph-devel for posterity...

The locking in the cap handling code is extremely hairy, with many
places where we need to take sleeping locks while we're in atomic
context (under spinlock, mostly). A lot of the problem is due to the
need to take the session->s_mutex.

For instance, there's this in ceph_check_caps:

a8599bd821d08 (Sage Weil           2009-10-06 11:31:12 -0700 1999)              if (session && session != cap->session) {
a8599bd821d08 (Sage Weil           2009-10-06 11:31:12 -0700 2000)                      dout("oops, wrong session %p mutex\n", session);
a8599bd821d08 (Sage Weil           2009-10-06 11:31:12 -0700 2001)                      mutex_unlock(&session->s_mutex);
a8599bd821d08 (Sage Weil           2009-10-06 11:31:12 -0700 2002)                      session = NULL;
a8599bd821d08 (Sage Weil           2009-10-06 11:31:12 -0700 2003)              }
a8599bd821d08 (Sage Weil           2009-10-06 11:31:12 -0700 2004)              if (!session) {
a8599bd821d08 (Sage Weil           2009-10-06 11:31:12 -0700 2005)                      session = cap->session;
a8599bd821d08 (Sage Weil           2009-10-06 11:31:12 -0700 2006)                      if (mutex_trylock(&session->s_mutex) == 0) {
a8599bd821d08 (Sage Weil           2009-10-06 11:31:12 -0700 2007)                              dout("inverting session/ino locks on %p\n",
a8599bd821d08 (Sage Weil           2009-10-06 11:31:12 -0700 2008)                                   session);
be655596b3de5 (Sage Weil           2011-11-30 09:47:09 -0800 2009)                              spin_unlock(&ci->i_ceph_lock);
a8599bd821d08 (Sage Weil           2009-10-06 11:31:12 -0700 2010)                              if (took_snap_rwsem) {
a8599bd821d08 (Sage Weil           2009-10-06 11:31:12 -0700 2011)                                      up_read(&mdsc->snap_rwsem);
a8599bd821d08 (Sage Weil           2009-10-06 11:31:12 -0700 2012)                                      took_snap_rwsem = 0;
a8599bd821d08 (Sage Weil           2009-10-06 11:31:12 -0700 2013)                              }
a8599bd821d08 (Sage Weil           2009-10-06 11:31:12 -0700 2014)                              mutex_lock(&session->s_mutex);
a8599bd821d08 (Sage Weil           2009-10-06 11:31:12 -0700 2015)                              goto retry;
a8599bd821d08 (Sage Weil           2009-10-06 11:31:12 -0700 2016)                      }
a8599bd821d08 (Sage Weil           2009-10-06 11:31:12 -0700 2017)              }

At this point, we're walking the inode's caps rbtree, while holding the
inode->i_ceph_lock. We're eventually going to need to send a cap message
to the MDS for this cap, but that requires the cap->session->s_mutex. We
try to take it without blocking first, but if that fails, we have to
unwind all of the locking and start over. Gross. That also makes the
handling of snap_rwsem much more complex than it really should be too.

It does this, despite the fact that the cap message doesn't actually
need much from the session (just the session->s_con, mostly). Most of
the info in the message comes from the inode and cap objects.

My question is: What is the s_mutex guaranteeing at this point?

More to the point, is it strictly required that we hold that mutex as we
marshal up the outgoing request? It would be much cleaner to be able to
just drop the spinlock after getting the ceph_msg_args ready to send,
then take the session mutex and send the request.

The state of the MDS session is not checked in this codepath before the
send, so it doesn't seem like ordering vs. session state messages is
very important. This _is_ ordered vs. regular MDS requests, but a
per-session mutex seems like a very heavyweight way to do that.

If we're concerned about reordering cap messages that involve the same
inode, then there are other ways to ensure that ordering that don't
require a coarse-grained mutex.

It's just not clear to me what data this mutex is protecting in this
case.

Any hints would be welcome!
-- 
Jeff Layton <jlayton@xxxxxxxxxx>