On Fri, Mar 27, 2020 at 12:58 AM Jeff Layton <jlayton@xxxxxxxxxx> wrote: > > I had mentioned this in standup this morning, but it's a bit of a > complex topic and Zheng asked me to send email instead. I'm also cc'ing > ceph-devel for posterity... > > The locking in the cap handling code is extremely hairy, with many > places where we need to take sleeping locks while we're in atomic > context (under spinlock, mostly). A lot of the problem is due to the > need to take the session->s_mutex. > > For instance, there's this in ceph_check_caps: > > a8599bd821d08 (Sage Weil 2009-10-06 11:31:12 -0700 1999) if (session && session != cap->session) { > a8599bd821d08 (Sage Weil 2009-10-06 11:31:12 -0700 2000) dout("oops, wrong session %p mutex\n", session); > a8599bd821d08 (Sage Weil 2009-10-06 11:31:12 -0700 2001) mutex_unlock(&session->s_mutex); > a8599bd821d08 (Sage Weil 2009-10-06 11:31:12 -0700 2002) session = NULL; > a8599bd821d08 (Sage Weil 2009-10-06 11:31:12 -0700 2003) } > a8599bd821d08 (Sage Weil 2009-10-06 11:31:12 -0700 2004) if (!session) { > a8599bd821d08 (Sage Weil 2009-10-06 11:31:12 -0700 2005) session = cap->session; > a8599bd821d08 (Sage Weil 2009-10-06 11:31:12 -0700 2006) if (mutex_trylock(&session->s_mutex) == 0) { > a8599bd821d08 (Sage Weil 2009-10-06 11:31:12 -0700 2007) dout("inverting session/ino locks on %p\n", > a8599bd821d08 (Sage Weil 2009-10-06 11:31:12 -0700 2008) session); > be655596b3de5 (Sage Weil 2011-11-30 09:47:09 -0800 2009) spin_unlock(&ci->i_ceph_lock); > a8599bd821d08 (Sage Weil 2009-10-06 11:31:12 -0700 2010) if (took_snap_rwsem) { > a8599bd821d08 (Sage Weil 2009-10-06 11:31:12 -0700 2011) up_read(&mdsc->snap_rwsem); > a8599bd821d08 (Sage Weil 2009-10-06 11:31:12 -0700 2012) took_snap_rwsem = 0; > a8599bd821d08 (Sage Weil 2009-10-06 11:31:12 -0700 2013) } > a8599bd821d08 (Sage Weil 2009-10-06 11:31:12 -0700 2014) mutex_lock(&session->s_mutex); > a8599bd821d08 (Sage Weil 2009-10-06 11:31:12 -0700 2015) goto retry; > a8599bd821d08 (Sage Weil 2009-10-06 11:31:12 -0700 2016) } > a8599bd821d08 (Sage Weil 2009-10-06 11:31:12 -0700 2017) } > > At this point, we're walking the inode's caps rbtree, while holding the > inode->i_ceph_lock. We're eventually going to need to send a cap message > to the MDS for this cap, but that requires the cap->session->s_mutex. We > try to take it without blocking first, but if that fails, we have to > unwind all of the locking and start over. Gross. That also makes the > handling of snap_rwsem much more complex than it really should be too. > > It does this, despite the fact that the cap message doesn't actually > need much from the session (just the session->s_con, mostly). Most of > the info in the message comes from the inode and cap objects. > > My question is: What is the s_mutex guaranteeing at this point? > > More to the point, is it strictly required that we hold that mutex as we > marshal up the outgoing request? It would be much cleaner to be able to > just drop the spinlock after getting the ceph_msg_args ready to send, > then take the session mutex and send the request. > > The state of the MDS session is not checked in this codepath before the > send, so it doesn't seem like ordering vs. session state messages is > very important. This _is_ ordered vs. regular MDS requests, but a > per-session mutex seems like a very heavyweight way to do that. > > If we're concerned about reordering cap messages that involve the same > inode, then there are other ways to ensure that ordering that don't > require a coarse-grained mutex. > > It's just not clear to me what data this mutex is protecting in this > case. I think it's mainly for message ordering. For example, a request may release multiple inodes' caps (by ceph_encode_inode_release). Before sending the request out, we need to prevent ceph_check_caps() from touch these inodes' caps and sending cap messages. > > Any hints would be welcome! > -- > Jeff Layton <jlayton@xxxxxxxxxx> >