Re: [PATCH 1/6] md-cluster: Protect communication with mutexes

NeilBrown <neilb@xxxxxxxx> · Fri, 13 Nov 2015 08:59:08 +1100

On Tue, Nov 10 2015, Goldwyn Rodrigues wrote:

> On 11/09/2015 05:31 PM, NeilBrown wrote:
>> On Fri, Nov 06 2015, rgoldwyn@xxxxxxx wrote:
>>
>>> From: Goldwyn Rodrigues <rgoldwyn@xxxxxxxx>
>>>
>>> Communication can happen through multiple threads. It is possible that
>>> one thread steps over another threads sequence. So, we use mutexes to
>>> protect both the send and receive sequences.
>>>
>>> We use a new flag CLUSTER_MD_COMM_LOCKED which is used to detect
>>> if communication is already locked. This is useful for cases where we need to
>>> take and hold the token DLM lock for a while. This bit is set only
>>> after locking communication.
>>
>> I don't understand what this flag is trying to achieve, but I'm fairly
>> sure it doesn't achieve it.
>
> Lets consider three specific cases of locking communication channels to 
> show the conflict:
>
> 1. resync_info_update(): communication is locked and release for sending 
> a message.
> 2. A regular md_update_sb(): communication is locked in 
> metadata_update_start() and unlocked in metadata_update_finish() after 
> writing to disk. In metadata_update_finish(), the sendmsg is called to 
> send METADATA_UPDATED.
> 3. An add operation which culminates in a md_update_sb(): Here the 
> communication is locked before initiating add. If the add is successful, 
> it results in md_update_sb(). In md_update_sb(), metadata_update_start() 
> has to check if the communication is already locked. If locked, it 
> should not lock again.

Oh, I get it now - thanks.

Going back and looking at the original commit I can now see that it does
say that, but I didn't understand the implication at the time.

I might be wrong again, but I think this approach is broken.
The 'add disk' sequence does:
  1/ ->add_new_disk which takes the lock
  2/ other stuff, protected by the lock
  3/ schedule a metadata update
  4/ metadata update is initiated, which doesn't take the lock
        because the flag is set
  5/ metadata update completes, lock is dropped.

What if some other event causes the metadata update to happen during
'2'?

Maybe if you delayed setting MD_CLUSTER_COMM_LOCKED until after '2'.
i.e. set it when scheduling the update.  So the flag means:
  "lock has been taken for metadata update"

One tricky bit there would be if metadata_update_start() found the bit
wasn't set, and then entered mutex_lock() in lock_comm().
When MD_CLUSTER_COMM_LOCKED gets set we want that code to stop waiting
and start doing useful things.  But it won't.

It might be easiest to make our own 'mutex' with a flag bit and a wait
queue.

Then calls to mutex_lock become
 wait_event(mddev->wait, !test_and_set_bit(bit, &cinfo->state));
mutex_unlock becoems
 clear_bit(bit, &cinfo->state);

and the mutex_lock used from metadata_update_start is
  wait_event(mddev->wait, !test_and_set_bit(bit, &cinfo->state) ||
                          test_and_clear_bit(MD_CLUSTER_COMM_LOCKED,
  ....));

That might work.... providing it is well documented.

(and you are right - mutex_trylock wouldn't have helped, I was
completely misunderstanding).

Thanks,
NeilBrown

>
> The flag is used only for case 3. If the communication is already 
> locked, it should not lock again. This flag is set only after 
> lock_comm() has executed, but is checked for in metadata_update_start(). 
> This should insure that any of the operations 1, 2 or 3 do not interfere 
> with each other.
>
> I am not sure if I have made the best effort to explain this. I had a 
> tough time getting it right (which may or may not be complete).
>
>>
>> Maybe if it was test_and_set_bit in metadata_update_start, it might make
>> sense.  But then I would suggest that clearing the bit be moved to
>> unlock_comm()
>>
>> Do you just want to use mutex_try_lock() to detect if communication is
>> already locked?
>
> Consider a race between md_update_sb() and sendmsg() [Case 1. and 2]
>
> mutex_try_lock() may not work in this situation because lock_comm() 
> could have been called by sendmsg(), which will release it as soon as 
> the message is sent. In the meantime (while the lock is locked), if a 
> metadata_update_sb() operation executes. It will not relock the 
> communication. This will result in the WARN_ON in unlock_comm() since 
> sendmsg() sequence had already unlocked it.
>
>
>>
>>>
>>> Also, Remove stray/ununsed sb_mutex.
>>
>> I already removed that it mainline - I should have mentioned.
>>
>> Thanks,
>> NeilBrown
>>
>>
>>>
>>> Signed-off-by: Goldwyn Rodrigues <rgoldwyn@xxxxxxxx>
>>> ---
>>>   drivers/md/md-cluster.c | 26 +++++++++++++++++++++-----
>>>   1 file changed, 21 insertions(+), 5 deletions(-)
>>>
>>> diff --git a/drivers/md/md-cluster.c b/drivers/md/md-cluster.c
>>> index b73729b..a93734e 100644
>>> --- a/drivers/md/md-cluster.c
>>> +++ b/drivers/md/md-cluster.c
>>> @@ -47,6 +47,7 @@ struct resync_info {
>>>   #define		MD_CLUSTER_WAITING_FOR_NEWDISK		1
>>>   #define		MD_CLUSTER_SUSPEND_READ_BALANCING	2
>>>   #define		MD_CLUSTER_BEGIN_JOIN_CLUSTER		3
>>> +#define		MD_CLUSTER_COMM_LOCKED			4
>>>
>>>
>>>   struct md_cluster_info {
>>> @@ -54,7 +55,8 @@ struct md_cluster_info {
>>>   	dlm_lockspace_t *lockspace;
>>>   	int slot_number;
>>>   	struct completion completion;
>>> -	struct mutex sb_mutex;
>>> +	struct mutex recv_mutex;
>>> +	struct mutex send_mutex;
>>>   	struct dlm_lock_resource *bitmap_lockres;
>>>   	struct dlm_lock_resource *resync_lockres;
>>>   	struct list_head suspend_list;
>>> @@ -503,6 +505,7 @@ static void recv_daemon(struct md_thread *thread)
>>>   	struct cluster_msg msg;
>>>   	int ret;
>>>
>>> +	mutex_lock(&cinfo->recv_mutex);
>>>   	/*get CR on Message*/
>>>   	if (dlm_lock_sync(message_lockres, DLM_LOCK_CR)) {
>>>   		pr_err("md/raid1:failed to get CR on MESSAGE\n");
>>> @@ -529,6 +532,7 @@ static void recv_daemon(struct md_thread *thread)
>>>   	ret = dlm_unlock_sync(message_lockres);
>>>   	if (unlikely(ret != 0))
>>>   		pr_info("unlock msg failed return %d\n", ret);
>>> +	mutex_unlock(&cinfo->recv_mutex);
>>>   }
>>>
>>>   /* lock_comm()
>>> @@ -542,20 +546,22 @@ static int lock_comm(struct md_cluster_info *cinfo)
>>>   {
>>>   	int error;
>>>
>>> -	if (cinfo->token_lockres->mode == DLM_LOCK_EX)
>>> -		return 0;
>>> +	mutex_lock(&cinfo->send_mutex);
>>>
>>>   	error = dlm_lock_sync(cinfo->token_lockres, DLM_LOCK_EX);
>>>   	if (error)
>>>   		pr_err("md-cluster(%s:%d): failed to get EX on TOKEN (%d)\n",
>>>   				__func__, __LINE__, error);
>>> +	mutex_lock(&cinfo->recv_mutex);
>>>   	return error;
>>>   }
>>>
>>>   static void unlock_comm(struct md_cluster_info *cinfo)
>>>   {
>>>   	WARN_ON(cinfo->token_lockres->mode != DLM_LOCK_EX);
>>> +	mutex_unlock(&cinfo->recv_mutex);
>>>   	dlm_unlock_sync(cinfo->token_lockres);
>>> +	mutex_unlock(&cinfo->send_mutex);
>>>   }
>>>
>>>   /* __sendmsg()
>>> @@ -709,7 +715,8 @@ static int join(struct mddev *mddev, int nodes)
>>>   	init_completion(&cinfo->completion);
>>>   	set_bit(MD_CLUSTER_BEGIN_JOIN_CLUSTER, &cinfo->state);
>>>
>>> -	mutex_init(&cinfo->sb_mutex);
>>> +	mutex_init(&cinfo->send_mutex);
>>> +	mutex_init(&cinfo->recv_mutex);
>>>   	mddev->cluster_info = cinfo;
>>>
>>>   	memset(str, 0, 64);
>>> @@ -839,7 +846,12 @@ static int slot_number(struct mddev *mddev)
>>>
>>>   static int metadata_update_start(struct mddev *mddev)
>>>   {
>>> -	return lock_comm(mddev->cluster_info);
>>> +	struct md_cluster_info *cinfo = mddev->cluster_info;
>>> +	int ret;
>>> +	if (test_and_clear_bit(MD_CLUSTER_COMM_LOCKED, &cinfo->state))
>>> +		return 0;
>>> +	ret = lock_comm(cinfo);
>>> +	return ret;
>>>   }
>>>
>>>   static int metadata_update_finish(struct mddev *mddev)
>>> @@ -864,6 +876,7 @@ static int metadata_update_finish(struct mddev *mddev)
>>>   		ret = __sendmsg(cinfo, &cmsg);
>>>   	} else
>>>   		pr_warn("md-cluster: No good device id found to send\n");
>>> +	clear_bit(MD_CLUSTER_COMM_LOCKED, &cinfo->state);
>>>   	unlock_comm(cinfo);
>>>   	return ret;
>>>   }
>>> @@ -871,6 +884,7 @@ static int metadata_update_finish(struct mddev *mddev)
>>>   static void metadata_update_cancel(struct mddev *mddev)
>>>   {
>>>   	struct md_cluster_info *cinfo = mddev->cluster_info;
>>> +	clear_bit(MD_CLUSTER_COMM_LOCKED, &cinfo->state);
>>>   	unlock_comm(cinfo);
>>>   }
>>>
>>> @@ -945,6 +959,7 @@ static int add_new_disk(struct mddev *mddev, struct md_rdev *rdev)
>>>   	memcpy(cmsg.uuid, uuid, 16);
>>>   	cmsg.raid_slot = cpu_to_le32(rdev->desc_nr);
>>>   	lock_comm(cinfo);
>>> +	set_bit(MD_CLUSTER_COMM_LOCKED, &cinfo->state);
>>>   	ret = __sendmsg(cinfo, &cmsg);
>>>   	if (ret)
>>>   		return ret;
>>> @@ -964,6 +979,7 @@ static int add_new_disk(struct mddev *mddev, struct md_rdev *rdev)
>>>   static void add_new_disk_cancel(struct mddev *mddev)
>>>   {
>>>   	struct md_cluster_info *cinfo = mddev->cluster_info;
>>> +	clear_bit(MD_CLUSTER_COMM_LOCKED, &cinfo->state);
>>>   	unlock_comm(cinfo);
>>>   }
>>>
>>> --
>>> 1.8.5.6
>
> -- 
> Goldwyn
Attachment:
signature.asc

Description: PGP signature