Re: [PATCH] votequorum: add API to clear the wait_for_all status

Jan Friesse <jfriesse@xxxxxxxxxx> · Mon, 11 Aug 2014 10:17:40 +0200

Christine Caulfield napsal(a):
On 11/08/14 08:59, Jan Friesse wrote:
Chrissie,

On 11/08/14 07:29, Jan Friesse wrote:
Chrissie,
patch looks generally good, but is there a reason to add new library
call instead of tracking "quorum.wait_for_all" and if set to 0, execute
code very similar to
message_handler_req_lib_votequorum_cancel_wait_for_all?

Yes. The point is not to clear wait_for_all itself, that's a
configuration option and we are not changing it - just the runtime wait
state. The config option needs to remain enabled for the next time nodes

Yes. But user can call corosync-cmapctl to change this variable. We
don't need to (or want to) to change it via reload. Very similar thing
is happening with expected votes. Take a look to
ed63c812afc15fc68ebd3363845a63f5c945623e (and this was actually
inspiration for what I'm suggesting). wait_for_all is totally same.
Allow natural selection. Dynamic change of "config" (but not stored to
config file).

No, it's not changing the config - even dynamically. It's changing a
state inside corosync, not even a dynamic configuration parameter.

Sure

wait_for_all_status is NOT the same thing as quorum.wait_for_all - not
even slightly. wait_for_all needs to remain set after this command
(whatever it turns out to be) for the next time a node goes down, we do
not want to have to wait for a reload for that to happen.

It will. cmap is NOT stored back into config, so wait_for_all WILL 
remain set after this command for the next time a node down (are you 
talking about local node, right?). No reload needs to happen.

If this is to be done using cmapctl (and I'm happy for that to be the
case) then altering runtime.votequorum.wait_for_all_status is the thing
to do.

You will then get pretty ugly recursion (you are calling 
update_wait_for_all_status).

That's why I believe it's just much cleaner to track wait_for_all.

Regards,
  Honza

CHrisse

Regards,
   Honza

are rebooted. This call is meant to be a temporary fix to a particular
node-outage, not a reconfiguration of the cluster.

If there was a key to watch for it would be
runtime.votequorum.wait_for_all_status - I'll investigate the
practicality of doing that maybe. At the time I was wary of
watching/changing runtime.* keys from userspace.

But if we decide to go with library call, there must be few things
fixed:
- version can be 7.1.0. We are adding call, not changing existing one
(so it's backwards compatible)
- We have to have support in cfgtool/quorumtool/... Keep in mind, that
main user (pcs) is not calling corosync API directly, but they are
using
CLI tools.

Ugh, I didn't realise that. Thanks

Chrissie

- There should be check if wait_for_all is really activated.

All these things would be solved by tracking "quorum_wait_for_all" for
free.

Regards,
   Honza

Christine Caulfield napsal(a):
It's possible in a two_node cluster (and others but it's more likely
with just two) that a node could be booted up after downtime or
failure
and the other node is not available for some reason. In this case it
would not be allowed to proceed because wait_for_all is enforced.

This patch provides an API call to clear this flag in the desperate
situation where that becomes necessary. It should only be used with
extreme caution and will be wrapped up in pcs which should also check
that fencing has been run.

Signed-Off-By: Christine Caulfield <ccaulfie@xxxxxxxxxx>

_______________________________________________
discuss mailing list
discuss@xxxxxxxxxxxx
http://lists.corosync.org/mailman/listinfo/discuss

_______________________________________________
discuss mailing list
discuss@xxxxxxxxxxxx
http://lists.corosync.org/mailman/listinfo/discuss