----- Original Message ----- > From: "heming zhao" <heming.zhao@xxxxxxxx> > To: "Xiao Ni" <xni@xxxxxxxxxx>, linux-raid@xxxxxxxxxxxxxxx, song@xxxxxxxxxx, "guoqing jiang" > <guoqing.jiang@xxxxxxxxxxxxxxx> > Cc: "lidong zhong" <lidong.zhong@xxxxxxxx>, neilb@xxxxxxx, colyli@xxxxxxx > Sent: Thursday, November 12, 2020 7:27:46 PM > Subject: Re: [PATCH v2] md/cluster: fix deadlock when doing reshape job > > Hello, > > On 11/12/20 1:08 PM, Xiao Ni wrote: > > > > > > On 11/11/2020 11:51 PM, Zhao Heming wrote: > >> There is a similar deadlock in commit 0ba959774e93 > >> ("md-cluster: use sync way to handle METADATA_UPDATED msg") > >> My patch fixed issue is very like commit 0ba959774e93, except <c>. > >> 0ba959774e93 step <c> is "update sb", my fix is "mdadm --remove" > >> > >> ... ... > >> + !test_and_set_bit(MD_CLUSTER_SEND_LOCK, &cinfo->state), > >> + msecs_to_jiffies(5000)); > >> + if (rv) > >> + return lock_token(cinfo, mddev_locked); > >> + else > >> + return -1; > >> } > > Hi Heming > > > > Can we handle this problem like metadata_update_start? lock_comm and > > metadata_update_start are two users that > > want to get token lock. lock_comm can do the same thing as > > metadata_update_start does. If so, process_recvd_msg > > can call md_reload_sb without waiting. All threads can work well when the > > initiated node release token lock. Resync > > can send message and clear MD_CLUSTER_SEND_LOCK, then lock_comm can go on > > working. In this way, all threads > > work successfully without failure. > > > > Currently MD_CLUSTER_SEND_LOCKED_ALREADY only for adding a new disk. > (please refer Documentation/driver-api/md/md-cluster.rst section: 5. Adding a > new Device") > During ADD_NEW_DISK process, md-cluster will block all other msg sending > until metadata_update_finish calls > unlock_comm. > > With your idea, md-cluster will allow to concurrently send msg. I'm not very > familiar with all raid1 scenarios. > But at least, you break the rule of handling ADD_NEW_DISK. it will allow send > other msg during ADD_NEW_DISK. Hi Heming It doesn't need to change MD_CLUSTER_SEND_LOCKED_ALREADY. Is it ok to do something like this: diff --git a/drivers/md/md-cluster.c b/drivers/md/md-cluster.c index 4aaf482..f6f576b 100644 --- a/drivers/md/md-cluster.c +++ b/drivers/md/md-cluster.c @@ -664,29 +664,12 @@ static void recv_daemon(struct md_thread *thread) * Takes the lock on the TOKEN lock resource so no other * node can communicate while the operation is underway. */ -static int lock_token(struct md_cluster_info *cinfo, bool mddev_locked) +static int lock_token(struct md_cluster_info *cinfo) { - int error, set_bit = 0; + int error; struct mddev *mddev = cinfo->mddev; - /* - * If resync thread run after raid1d thread, then process_metadata_update - * could not continue if raid1d held reconfig_mutex (and raid1d is blocked - * since another node already got EX on Token and waitting the EX of Ack), - * so let resync wake up thread in case flag is set. - */ - if (mddev_locked && !test_bit(MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD, - &cinfo->state)) { - error = test_and_set_bit_lock(MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD, - &cinfo->state); - WARN_ON_ONCE(error); - md_wakeup_thread(mddev->thread); - set_bit = 1; - } error = dlm_lock_sync(cinfo->token_lockres, DLM_LOCK_EX); - if (set_bit) - clear_bit_unlock(MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD, &cinfo->state); - if (error) pr_err("md-cluster(%s:%d): failed to get EX on TOKEN (%d)\n", __func__, __LINE__, error); @@ -701,10 +684,30 @@ static int lock_token(struct md_cluster_info *cinfo, bool mddev_locked) */ static int lock_comm(struct md_cluster_info *cinfo, bool mddev_locked) { + int ret, set_bit = 0; + + /* + * If resync thread run after raid1d thread, then process_metadata_update + * could not continue if raid1d held reconfig_mutex (and raid1d is blocked + * since another node already got EX on Token and waitting the EX of Ack), + * so let resync wake up thread in case flag is set. + */ + if (mddev_locked && !test_bit(MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD, + &cinfo->state)) { + error = test_and_set_bit_lock(MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD, + &cinfo->state); + WARN_ON_ONCE(error); + md_wakeup_thread(mddev->thread); + set_bit = 1; + } + wait_event(cinfo->wait, !test_and_set_bit(MD_CLUSTER_SEND_LOCK, &cinfo->state)); - return lock_token(cinfo, mddev_locked); + ret = lock_token(cinfo, mddev_locked); + if (ret && set_bit) + clear_bit_unlock(MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD, &cinfo->state); + return ret; } static void unlock_comm(struct md_cluster_info *cinfo) > > >> static void unlock_comm(struct md_cluster_info *cinfo) > >> @@ -784,9 +789,11 @@ static int sendmsg(struct md_cluster_info *cinfo, > >> struct cluster_msg *cmsg, > >> { > >> int ret; > >> - lock_comm(cinfo, mddev_locked); > >> ... ... > >> + if (mddev_is_clustered(mddev)) { > >> + if (md_cluster_ops->remove_disk(mddev, rdev)) > >> + goto busy; > >> + } > >> md_kick_rdev_from_array(rdev); > >> set_bit(MD_SB_CHANGE_DEVS, &mddev->sb_flags); > > These codes are not related with this deadlock problem. It's better to file > > a new patch > > to fix checking the return value problem. > > > > In my opinion: we should include these code in this patch. > For totally fix the deadlock, md layer should return error to userspace. > > But with your comments, I found other potential issues of md-cluster. > There still have some cluster_ops APIs, which caller doesn't care error > return: > .resync_finish - may cause other nodes never clean MD_RESYNCING_REMOTE. > .resync_info_update - this error could be safely ignore > .metadata_update_finish - may cause other nodes kernel md info is > inconsistent with disk metadata. > > maybe I or other guys fix them in the future. > > > Best Regards > > Xiao > > > >