Jeff Layton <jlayton@xxxxxxxxxx> writes: > On Thu, 2020-11-12 at 13:15 -0500, Jeff Layton wrote: >> Geng Jichao reported a rather complex deadlock involving several >> moving parts: >> >> 1) readahead is issued against an inode and some of its pages are locked >> while the read is in flight >> >> 2) the same inode is evicted from the cache, and this task gets stuck >> waiting for the page lock because of the above readahead >> >> 3) another task is processing a reply trace, and looks up the inode >> being evicted while holding the s_mutex. That ends up waiting for the >> eviction to complete >> >> 4) a write reply for an unrelated inode is then processed in the >> ceph_con_workfn job. It calls ceph_check_caps after putting wrbuffer >> caps, and that gets stuck waiting on the s_mutex held by 3. >> >> The reply to "1" is stuck behind the write reply in "4", so we deadlock >> at that point. >> >> This patch changes the trace processing to call ceph_get_inode outside >> of the s_mutex and snap_rwsem, which should break the cycle above. >> >> URL: https://tracker.ceph.com/issues/47998 >> Reported-by: Geng Jichao <gengjichao@xxxxxx> >> Signed-off-by: Jeff Layton <jlayton@xxxxxxxxxx> >> --- >> fs/ceph/inode.c | 13 ++++--------- >> fs/ceph/mds_client.c | 15 +++++++++++++++ >> 2 files changed, 19 insertions(+), 9 deletions(-) >> > > My belief is that this patch probably won't hurt anything and might fix > this specific deadlock. I do however wonder if we might have a whole > class of deadlocks of this nature. I can't quite prove it by inspection > though, so testing with this would be a good datapoint. Yeah, I keep staring at this and can't decide either. But I agree with you: this may or may not fix the issue, but shouldn't hurt. Just a minor comment bellow: >> diff --git a/fs/ceph/inode.c b/fs/ceph/inode.c >> index 88bbeb05bd27..9b85d86d8efb 100644 >> --- a/fs/ceph/inode.c >> +++ b/fs/ceph/inode.c >> @@ -1315,15 +1315,10 @@ int ceph_fill_trace(struct super_block *sb, struct ceph_mds_request *req) >> } >> >> >> >> >> if (rinfo->head->is_target) { >> - tvino.ino = le64_to_cpu(rinfo->targeti.in->ino); >> - tvino.snap = le64_to_cpu(rinfo->targeti.in->snapid); >> - >> - in = ceph_get_inode(sb, tvino); >> - if (IS_ERR(in)) { >> - err = PTR_ERR(in); >> - goto done; >> - } >> + /* Should be filled in by handle_reply */ >> + BUG_ON(!req->r_target_inode); >> >> >> >> >> + in = req->r_target_inode; >> err = ceph_fill_inode(in, req->r_locked_page, &rinfo->targeti, >> NULL, session, >> (!test_bit(CEPH_MDS_R_ABORTED, &req->r_req_flags) && >> @@ -1333,13 +1328,13 @@ int ceph_fill_trace(struct super_block *sb, struct ceph_mds_request *req) >> if (err < 0) { >> pr_err("ceph_fill_inode badness %p %llx.%llx\n", >> in, ceph_vinop(in)); >> + req->r_target_inode = NULL; >> if (in->i_state & I_NEW) >> discard_new_inode(in); >> else >> iput(in); >> goto done; >> } >> - req->r_target_inode = in; >> if (in->i_state & I_NEW) >> unlock_new_inode(in); >> } >> diff --git a/fs/ceph/mds_client.c b/fs/ceph/mds_client.c >> index b3c941503f8c..48db1f593d81 100644 >> --- a/fs/ceph/mds_client.c >> +++ b/fs/ceph/mds_client.c >> @@ -3179,6 +3179,21 @@ static void handle_reply(struct ceph_mds_session *session, struct ceph_msg *msg) >> err = parse_reply_info(session, msg, rinfo, session->s_con.peer_features); >> mutex_unlock(&mdsc->mutex); >> >> >> >> >> + /* Must find target inode outside of mutexes to avoid deadlocks */ >> + if (rinfo->head->is_target) { This condition should probably be if ((err >= 0) && rinfo->head->is_target) { This would allow a faster exit path and eventually avoid masking this 'err' value with another one from ceph_get_inode(). Cheers, -- Luis >> + struct inode *in; >> + struct ceph_vino tvino = { .ino = le64_to_cpu(rinfo->targeti.in->ino), >> + .snap = le64_to_cpu(rinfo->targeti.in->snapid) }; >> + >> + in = ceph_get_inode(mdsc->fsc->sb, tvino); >> + if (IS_ERR(in)) { >> + err = PTR_ERR(in); >> + mutex_lock(&session->s_mutex); >> + goto out_err; >> + } >> + req->r_target_inode = in; >> + } >> + >> mutex_lock(&session->s_mutex); >> if (err < 0) { >> pr_err("mdsc_handle_reply got corrupt reply mds%d(tid:%lld)\n", mds, tid); > > -- > Jeff Layton <jlayton@xxxxxxxxxx> >