Re: on exiting maintenance mode

Andrew Beekhof <andrew@xxxxxxxxxxx> · Thu, 28 Aug 2014 08:57:26 +1000

On 28 Aug 2014, at 3:09 am, Ferenc Wagner <wferi@xxxxxxx> wrote:

> Andrew Beekhof <andrew@xxxxxxxxxxx> writes:
> 
>> On 27 Aug 2014, at 3:40 am, Ferenc Wagner <wferi@xxxxxxx> wrote:
> 
>>> However, it got restarted seamlessly, without the node being fenced, so
>>> I did not even notice this until now.  Should this have resulted in the
>>> node being fenced?
>> 
>> Depends how fast the node can respawn.
> 
> You mean how fast crmd can respawn?  How much time does it have to
> respawn to avoid being fenced?

Until a new node can be elected DC, invoke the policy engine and start fencing.

> 
>>> crmd: [13794]: ERROR: verify_stopped: Resource vm-web5 was active at shutdown.  You may ignore this error if it is unmanaged.
>> 
>> In maintenance mode, everything is unmanaged. So that would be expected.
> 
> Is maintenance mode the same as unmanaging all resources?  I think the
> latter does not cancel the monitor operations here...

Right. One cancels monitor operations too.

> 
>>>> The discovery usually happens at the point the cluster is started on
>>>> a node.
>>> 
>>> A local discovery did happen, but it could not find anything, as the
>>> cluster was started by the init scripts, well before any resource could
>>> have been moved to the freshly rebooted node (manually, to free the next
>>> node for rebooting).
>> 
>> Thats your problem then, you've started resources outside of the
>> control of the cluster.
> 
> Some of them, yes, and moved the rest between the nodes.  All this
> circumventing the cluster.
> 
>> Two options... recurring monitor actions with role=Stopped would have
>> caught this
> 
> Even in maintenance mode?  Wouldn't they have been cancelled just like
> the ordinary recurring monitor actions?

Good point. Perhaps they wouldn't.

> 
> I guess adding them would run a recurring monitor operation for every
> resource on every node, only with different expectations, right?
> 
>> or you can run crm_resource --cleanup after you've moved resources around.
> 
> I actually ran some crm resource cleanups for a couple of resources, and
> those really were not started on exiting maintenance mode.
> 
>>>> Maintenance mode just prevents the cluster from doing anything about it.
>>> 
>>> Fine.  So I should have restarted Pacemaker on each node before leaving
>>> maintenance mode, right?  Or is there a better way?
>> 
>> See above
> 
> So crm_resource -r whatever -C is the way, for each resource separately.
> Is there no way to do this for all resources at once?

I think you can just drop the -r

> 
>>> You say in the above thread that resource definitions can be changed:
>>> http://thread.gmane.org/gmane.linux.highavailability.user/39121/focus=39437
>>> Let me quote from there (starting with the words of Ulrich Windl):
>>> 
>>>>>>> I think it's a common misconception that you can modify cluster
>>>>>>> resources while in maintenance mode:
>>>>>> 
>>>>>> No, you _should_ be able to.  If that's not the case, its a bug.
>>>>> 
>>>>> So the end of maintenance mode starts with a "re-probe"?
>>>> 
>>>> No, but it doesn't need to.  
>>>> The policy engine already knows if the resource definitions changed
>>>> and the recurring monitor ops will find out if any are not running.
>>> 
>>> My experiences show that you may not *move around* resources while in
>>> maintenance mode.
>> 
>> Correct
>> 
>>> That would indeed require a cluster-wide re-probe, which does not
>>> seem to happen (unless forced some way).  Probably there was some
>>> misunderstanding in the above discussion, I guess Ulrich meant moving
>>> resources when he wrote "modifying cluster resources".  Does this
>>> make sense?
>> 
>> No, I've reasonably sure he meant changing their definitions in the cib.
>> Or at least thats what I thought he meant at the time.
> 
> Nobody could blame you for that, because that's what it means.  But then
> he inquired about a "re-probe", which fits more the problem of changing
> the status of resources, not their definition.  Actually, I was so
> firmly stuck in this mind set, that first I wanted to ask you to
> reconsider, your response felt so much out of place.  That's all about
> history for now...
> 
> After all this, I suggest to clarify this issue in the fine manual.
> I've read it a couple of times, and still got the wrong impression.

Which specific section do you suggest?
Attachment:
signature.asc

Description: Message signed with OpenPGP using GPGMail
-- 
Linux-cluster mailing list
Linux-cluster@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/linux-cluster