Re: [PATCH 14/53] FIX: Cannot exit monitor after takeover

Neil Brown <neilb@xxxxxxx> · Wed, 1 Dec 2010 09:06:50 +1100

On Tue, 30 Nov 2010 16:03:16 +0000 "Kwolek, Adam" <adam.kwolek@xxxxxxxxx>
wrote:

> The problem is that, when raid0 array is about unfreezing and this is single/last array in container,
> Ping to this container causes to mdmon not to exit.
> In such condition managemon receives message and in handle_message() for ping case, calls wakeup_monitor()
> and then goes in to loop for monitor_loop_cnt update 
> 1. this occurs after timeout 
> 2. when this happens managemon stops on pselect() and as there is nothing to monitor in never wakeups.
> 3. monitor waits to be allowed to exit on open handlers.
> 
> How can this be resolved:
> 1. do not ping for last raid0 array during unfreezing (I've reworked patch to meet this condition)
> 2. guard waiting for monitor_loop_cnt change in handle_message() with:
> 	if (container->arrays)
> 
> 3. change in manage member condition:
> 	if (sigterm)
> 		Wakeup_monitor();
> 
> To
> 	if (sigterm || (container->arrays == NULL))
> 		Wakeup_monitor();
> 
> This causes additional monitor wakeup.
> 
> Any of method causes mdmon to exit as expected. 
> In cases 2 and 3 it takes a while (we are waiting on communication timeouts).
> Method 1 is fast and we are not blocking mdmon exit by communication.

Thanks for the explanation!
I definitely want to fix the managemon/monitor interaction so that it doesn't
hang as you describe.  I might end up with something a lot more heavy-weight
that the changes you suggest.

It might still be OK to include your option '1' as well - I decide when you
post the patch.

thanks,
NeilBrown

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html