On 21 April 2015 at 04:21, Dennis <dennisr@xxxxxxxx> wrote:
OK thanks. One of the motivations for asking these questions is that we are investigating ways to implement automated node removal from a VIP pool. We would like to be able to have the VIP management software (a dumb load balancer currently) be able to query the health of a particular node directly and if that node reports back to the VIP manager that it is lagging to much, have the VIP manager take the node out of it’s pool of backend servers.
The challenge there is that the lagging node may have changes it has not replicated to its peers yet.
If you remove it from the BDR group without letting its downstreams replay up to its current write position, those changes will be "cut away" from the BDR group. They'll still be on the node you removed, but will never be replayed to the rest of the systems.
Your strategy is very reasonable for nodes where you only do reads, it's only an issue when every node is an active master accepting writes that all nodes must see.
One possible way to mitigate this would be adding support for synchronous_standby_names = 'all' in PostgreSQL and allowing a node to be switched into sync-write mode, where nothing commits locally until synced to all peers. It would thus be safe to remove the node at any time, even if it's badly lagging behind on its replay from upstream peers. (This would be significant feature development that is not currently targeted for BDR's roadmap).
Another, which is a current development target, is to force a node that's being removed into read-only mode and flush its replication queues before removing it. The read-only mode would preferably only restrict replicated changes, so you could still use TEMPORARY and UNLOGGED tables, etc, thus making it useful for enforcing read-only nodes in horizontal read-scaling use cases. There is no ETA on this planned feature yet.
Currently it appears I will have to query the other nodes in the cluster to determine the replication healthiness status of a particular node, and figure out a way to send that status back to the VIP manager in a way it can act on it.
If you're in a design where all nodes are write masters, yes, that is correct.
Any suggestions on how to accomplish that would be appreciated.
Just make direct libpq connections to each node from the monitoring host. You should generally be doing that anyway for your node health monitoring.
If you can't make inbound connections, do it on a push model, e.g. nsca-ng and Icinga's passive mode. This is something that's routinely done for clients and works well.