Re: [PATCH 2/2] md/cluster: fix deadlock when doing reshape job

"heming.zhao@xxxxxxxx" <heming.zhao@xxxxxxxx> · Tue, 10 Nov 2020 10:24:36 +0800

On 11/10/20 2:06 AM, Song Liu wrote:
> On Sun, Nov 8, 2020 at 6:02 PM heming.zhao@xxxxxxxx
> <heming.zhao@xxxxxxxx> wrote:
>>
>> Please note, I gave two solutions for this bug in cover-letter.
>> This patch uses solution 2. For detail, please check cover-letter.
>>
>> Thank you.
>>
> 
> [...]
> 
>>>
>>> How to fix:
>>>
>>> There are two sides to fix (or break the dead loop):
>>> 1. on sending msg side, modify lock_comm, change it to return
>>>      success/failed.
>>>      This will make mdadm cmd return error when lock_comm is timeout.
>>> 2. on receiving msg side, process_metadata_update need to add error
>>>      handling.
>>>      currently, other msg types won't trigger error or error doesn't need
>>>      to return sender. So only process_metadata_update need to modify.
>>>
>>> Ether of 1 & 2 can fix the hunging issue, but I prefer fix on both side.
>>>
> 
> Similar comments on how to make the commit log easy to understand.
> Besides that, please split the change into two commits, for fix #1 and #2
> respectively.
> 

My comment meaning is that solution 2 also has two sub-solutions: sending side or receiving side.
(but in fact, there are 3 sub-solutions: sending, receiving & both sides)

sending side, related with patch 2 functions: sendmsg & lock_comm
 (code flow: sendmsg => lock_comm)

receiving side, related with patch 2 functions: process_recvd_msg & process_metadata_update
 (code flow: process_recvd_msg => process_metadata_update)

To break any side waiting can break deadlock. In the patch 2, my fix is both sides.