Re: [PATCH v2] md/cluster: fix deadlock when doing reshape job

"heming.zhao@xxxxxxxx" <heming.zhao@xxxxxxxx> · Fri, 13 Nov 2020 21:23:39 +0800

On 11/13/20 10:50 AM, Xiao Ni wrote:
> 
> 
> ----- Original Message -----
>> From: "heming zhao" <heming.zhao@xxxxxxxx>
>> To: "Xiao Ni" <xni@xxxxxxxxxx>, linux-raid@xxxxxxxxxxxxxxx, song@xxxxxxxxxx, "guoqing jiang"
>> <guoqing.jiang@xxxxxxxxxxxxxxx>
>> Cc: "lidong zhong" <lidong.zhong@xxxxxxxx>, neilb@xxxxxxx, colyli@xxxxxxx
>> Sent: Thursday, November 12, 2020 7:27:46 PM
>> Subject: Re: [PATCH v2] md/cluster: fix deadlock when doing reshape job
>>
>> Hello,
>>
>> On 11/12/20 1:08 PM, Xiao Ni wrote:
>>>
>>>
>>> On 11/11/2020 11:51 PM, Zhao Heming wrote:
>>>> There is a similar deadlock in commit 0ba959774e93
>>>> ("md-cluster: use sync way to handle METADATA_UPDATED msg")
>>>> My patch fixed issue is very like commit 0ba959774e93, except <c>.
>>>> 0ba959774e93 step <c> is "update sb", my fix is "mdadm --remove"
>>>>
>>>> ... ...
>>>> +               !test_and_set_bit(MD_CLUSTER_SEND_LOCK, &cinfo->state),
>>>> +               msecs_to_jiffies(5000));
>>>> +    if (rv)
>>>> +        return lock_token(cinfo, mddev_locked);
>>>> +    else
>>>> +        return -1;
>>>>    }
>>> Hi Heming
>>>
>>> Can we handle this problem like metadata_update_start? lock_comm and
>>> metadata_update_start are two users that
>>> want to get token lock. lock_comm can do the same thing as
>>> metadata_update_start does. If so, process_recvd_msg
>>> can call md_reload_sb without waiting. All threads can work well when the
>>> initiated node release token lock. Resync
>>> can send message and clear MD_CLUSTER_SEND_LOCK, then lock_comm can go on
>>> working. In this way, all threads
>>> work successfully without failure.
>>>
>>
>> Currently MD_CLUSTER_SEND_LOCKED_ALREADY only for adding a new disk.
>> (please refer Documentation/driver-api/md/md-cluster.rst section: 5. Adding a
>> new Device")
>> During ADD_NEW_DISK process, md-cluster will block all other msg sending
>> until metadata_update_finish calls
>> unlock_comm.
>>
>> With your idea, md-cluster will allow to concurrently send msg. I'm not very
>> familiar with all raid1 scenarios.
>> But at least, you break the rule of handling ADD_NEW_DISK. it will allow send
>> other msg during ADD_NEW_DISK.
> 
> Hi Heming
> 
> It doesn't need to change MD_CLUSTER_SEND_LOCKED_ALREADY. Is it ok to do something like this:
> 
> diff --git a/drivers/md/md-cluster.c b/drivers/md/md-cluster.c
> index 4aaf482..f6f576b 100644
> --- a/drivers/md/md-cluster.c
> +++ b/drivers/md/md-cluster.c
> @@ -664,29 +664,12 @@ static void recv_daemon(struct md_thread *thread)
>    * Takes the lock on the TOKEN lock resource so no other
>    * node can communicate while the operation is underway.
>    */
> -static int lock_token(struct md_cluster_info *cinfo, bool mddev_locked)
> +static int lock_token(struct md_cluster_info *cinfo)
>   {
> -	int error, set_bit = 0;
> +	int error;
>   	struct mddev *mddev = cinfo->mddev;
>   
> -	/*
> -	 * If resync thread run after raid1d thread, then process_metadata_update
> -	 * could not continue if raid1d held reconfig_mutex (and raid1d is blocked
> -	 * since another node already got EX on Token and waitting the EX of Ack),
> -	 * so let resync wake up thread in case flag is set.
> -	 */
> -	if (mddev_locked && !test_bit(MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD,
> -				      &cinfo->state)) {
> -		error = test_and_set_bit_lock(MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD,
> -					      &cinfo->state);
> -		WARN_ON_ONCE(error);
> -		md_wakeup_thread(mddev->thread);
> -		set_bit = 1;
> -	}
>   	error = dlm_lock_sync(cinfo->token_lockres, DLM_LOCK_EX);
> -	if (set_bit)
> -		clear_bit_unlock(MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD, &cinfo->state);
> -
>   	if (error)
>   		pr_err("md-cluster(%s:%d): failed to get EX on TOKEN (%d)\n",
>   				__func__, __LINE__, error);
> @@ -701,10 +684,30 @@ static int lock_token(struct md_cluster_info *cinfo, bool mddev_locked)
>    */
>   static int lock_comm(struct md_cluster_info *cinfo, bool mddev_locked)
>   {
> +	int ret, set_bit = 0;
> +
> +	/*
> +	 * If resync thread run after raid1d thread, then process_metadata_update
> +	 * could not continue if raid1d held reconfig_mutex (and raid1d is blocked
> +	 * since another node already got EX on Token and waitting the EX of Ack),
> +	 * so let resync wake up thread in case flag is set.
> +	 */
> +	if (mddev_locked && !test_bit(MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD,
> +				      &cinfo->state)) {
> +		error = test_and_set_bit_lock(MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD,
> +					      &cinfo->state);
> +		WARN_ON_ONCE(error);
> +		md_wakeup_thread(mddev->thread);
> +		set_bit = 1;
> +	}
> +
>   	wait_event(cinfo->wait,
>   		   !test_and_set_bit(MD_CLUSTER_SEND_LOCK, &cinfo->state));
>   
> -	return lock_token(cinfo, mddev_locked);
> +	ret = lock_token(cinfo, mddev_locked);
> +	if (ret && set_bit)
> +		clear_bit_unlock(MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD, &cinfo->state);
> +	return ret;
>   }
>   
>   static void unlock_comm(struct md_cluster_info *cinfo)
> 

thank you for your comments & idea.
I totally understand your solution, it's better than mine.
I will use this idea to file v3 patch.