Re: [RFC PATCH] ceph: guard against __ceph_remove_cap races

Xiubo Li <xiubli@xxxxxxxxxx> · Tue, 17 Dec 2019 09:46:29 +0800

On 2019/12/16 21:35, Jeff Layton wrote:
On Mon, 2019-12-16 at 09:57 +0800, Xiubo Li wrote:
On 2019/12/13 1:31, Jeff Layton wrote:
I believe it's possible that we could end up with racing calls to
__ceph_remove_cap for the same cap. If that happens, the cap->ci
pointer will be zereoed out and we can hit a NULL pointer dereference.

Once we acquire the s_cap_lock, check for the ci pointer being NULL,
and just return without doing anything if it is.

URL: https://tracker.ceph.com/issues/43272
Signed-off-by: Jeff Layton <jlayton@xxxxxxxxxx>
---
   fs/ceph/caps.c | 21 ++++++++++++++++-----
   1 file changed, 16 insertions(+), 5 deletions(-)

This is the only scenario that made sense to me in light of Ilya's
analysis on the tracker above. I could be off here though -- the locking
around this code is horrifically complex, and I could be missing what
should guard against this scenario.
Checked the downstream 3.10.0-862.14.4 code, it seems that when doing
trim_caps_cb() and at the same time if the inode is being destroyed we
could hit this.

Yes, RHEL7 kernels clearly have this race. We can probably pull in
d6e47819721ae2d9 to fix it there.

Yeah, it is.


All the __ceph_remove_cap() calls will be protected by the
"ci->i_ceph_lock" lock, only except when destroying the inode.

And the upstream seems have no this problem now.

The only way the i_ceph_lock helps this is if you don't drop it after
you've found the cap in the inode's rbtree.

The only callers that don't hold the i_ceph_lock continuously are the
ceph_iterate_session_caps callbacks. Those however are iterating over
the session->s_caps list, so they shouldn't have a problem there either.

So I agree -- upstream doesn't have this problem and we can drop this
patch. We'll just have to focus on fixing this in RHEL7 instead.

Longer term, I think we need to consider a substantial overhaul of the
cap handling code. The locking in here is much more complex than is
warranted for what this code does. I'll probably start looking at that
once the dust settles on some other efforts.

Yeah, we should make sure that the ci not to be released when we are 
still accessing it, from this case it seems it could happen, maybe the 
iget/iput could help.

BRs



Thoughts?

diff --git a/fs/ceph/caps.c b/fs/ceph/caps.c
index 9d09bb53c1ab..7e39ee8eff60 100644
--- a/fs/ceph/caps.c
+++ b/fs/ceph/caps.c
@@ -1046,11 +1046,22 @@ static void drop_inode_snap_realm(struct ceph_inode_info *ci)
   void __ceph_remove_cap(struct ceph_cap *cap, bool queue_release)
   {
   	struct ceph_mds_session *session = cap->session;
-	struct ceph_inode_info *ci = cap->ci;
-	struct ceph_mds_client *mdsc =
-		ceph_sb_to_client(ci->vfs_inode.i_sb)->mdsc;
+	struct ceph_inode_info *ci;
+	struct ceph_mds_client *mdsc;
   	int removed = 0;
   
+	spin_lock(&session->s_cap_lock);
+	ci = cap->ci;
+	if (!ci) {
+		/*
+		 * Did we race with a competing __ceph_remove_cap call? If
+		 * ci is zeroed out, then just unlock and don't do anything.
+		 * Assume that it's on its way out anyway.
+		 */
+		spin_unlock(&session->s_cap_lock);
+		return;
+	}
+
   	dout("__ceph_remove_cap %p from %p\n", cap, &ci->vfs_inode);
   
   	/* remove from inode's cap rbtree, and clear auth cap */
@@ -1058,13 +1069,12 @@ void __ceph_remove_cap(struct ceph_cap *cap, bool queue_release)
   	if (ci->i_auth_cap == cap)
   		ci->i_auth_cap = NULL;
   
-	/* remove from session list */
-	spin_lock(&session->s_cap_lock);
   	if (session->s_cap_iterator == cap) {
   		/* not yet, we are iterating over this very cap */
   		dout("__ceph_remove_cap  delaying %p removal from session %p\n",
   		     cap, cap->session);
   	} else {
+		/* remove from session list */
   		list_del_init(&cap->session_caps);
   		session->s_nr_caps--;
   		cap->session = NULL;
@@ -1072,6 +1082,7 @@ void __ceph_remove_cap(struct ceph_cap *cap, bool queue_release)
   	}
   	/* protect backpointer with s_cap_lock: see iterate_session_caps */
   	cap->ci = NULL;
+	mdsc = ceph_sb_to_client(ci->vfs_inode.i_sb)->mdsc;
   
   	/*
   	 * s_cap_reconnect is protected by s_cap_lock. no one changes