I had mentioned this in standup this morning, but it's a bit of a complex topic and Zheng asked me to send email instead. I'm also cc'ing ceph-devel for posterity... The locking in the cap handling code is extremely hairy, with many places where we need to take sleeping locks while we're in atomic context (under spinlock, mostly). A lot of the problem is due to the need to take the session->s_mutex. For instance, there's this in ceph_check_caps: a8599bd821d08 (Sage Weil 2009-10-06 11:31:12 -0700 1999) if (session && session != cap->session) { a8599bd821d08 (Sage Weil 2009-10-06 11:31:12 -0700 2000) dout("oops, wrong session %p mutex\n", session); a8599bd821d08 (Sage Weil 2009-10-06 11:31:12 -0700 2001) mutex_unlock(&session->s_mutex); a8599bd821d08 (Sage Weil 2009-10-06 11:31:12 -0700 2002) session = NULL; a8599bd821d08 (Sage Weil 2009-10-06 11:31:12 -0700 2003) } a8599bd821d08 (Sage Weil 2009-10-06 11:31:12 -0700 2004) if (!session) { a8599bd821d08 (Sage Weil 2009-10-06 11:31:12 -0700 2005) session = cap->session; a8599bd821d08 (Sage Weil 2009-10-06 11:31:12 -0700 2006) if (mutex_trylock(&session->s_mutex) == 0) { a8599bd821d08 (Sage Weil 2009-10-06 11:31:12 -0700 2007) dout("inverting session/ino locks on %p\n", a8599bd821d08 (Sage Weil 2009-10-06 11:31:12 -0700 2008) session); be655596b3de5 (Sage Weil 2011-11-30 09:47:09 -0800 2009) spin_unlock(&ci->i_ceph_lock); a8599bd821d08 (Sage Weil 2009-10-06 11:31:12 -0700 2010) if (took_snap_rwsem) { a8599bd821d08 (Sage Weil 2009-10-06 11:31:12 -0700 2011) up_read(&mdsc->snap_rwsem); a8599bd821d08 (Sage Weil 2009-10-06 11:31:12 -0700 2012) took_snap_rwsem = 0; a8599bd821d08 (Sage Weil 2009-10-06 11:31:12 -0700 2013) } a8599bd821d08 (Sage Weil 2009-10-06 11:31:12 -0700 2014) mutex_lock(&session->s_mutex); a8599bd821d08 (Sage Weil 2009-10-06 11:31:12 -0700 2015) goto retry; a8599bd821d08 (Sage Weil 2009-10-06 11:31:12 -0700 2016) } a8599bd821d08 (Sage Weil 2009-10-06 11:31:12 -0700 2017) } At this point, we're walking the inode's caps rbtree, while holding the inode->i_ceph_lock. We're eventually going to need to send a cap message to the MDS for this cap, but that requires the cap->session->s_mutex. We try to take it without blocking first, but if that fails, we have to unwind all of the locking and start over. Gross. That also makes the handling of snap_rwsem much more complex than it really should be too. It does this, despite the fact that the cap message doesn't actually need much from the session (just the session->s_con, mostly). Most of the info in the message comes from the inode and cap objects. My question is: What is the s_mutex guaranteeing at this point? More to the point, is it strictly required that we hold that mutex as we marshal up the outgoing request? It would be much cleaner to be able to just drop the spinlock after getting the ceph_msg_args ready to send, then take the session mutex and send the request. The state of the MDS session is not checked in this codepath before the send, so it doesn't seem like ordering vs. session state messages is very important. This _is_ ordered vs. regular MDS requests, but a per-session mutex seems like a very heavyweight way to do that. If we're concerned about reordering cap messages that involve the same inode, then there are other ways to ensure that ordering that don't require a coarse-grained mutex. It's just not clear to me what data this mutex is protecting in this case. Any hints would be welcome! -- Jeff Layton <jlayton@xxxxxxxxxx>