Re: Postgres PAF setup

"Jehan-Guillaume (ioguix) de Rorthais" <ioguix@xxxxxxx> · Tue, 24 Apr 2018 17:08:55 +0200

On Mon, 23 Apr 2018 18:09:43 +0000
Andrew Edenburn <andrew.edenburn@xxxxxx> wrote:

> I am having issues with my PAF setup.  I am new to Postgres and have setup
> the cluster as seen below. I am getting this error when trying to start my
> cluster resources.
> [...]
> 
> cleanup and clear is not fixing any issues and I am not seeing anything in
> the logs.  Any help would be greatly appreciated.

This lack a lot of information.

According to the PAF ressource agent, your instances are in an "unexpected
state" on both nodes while PAF was actually trying to stop it.

Pacemaker might decide to stop a ressource if the start operation fails.
Stopping it when the start failed give some chances to the resource agent to
stop the resource gracefully if still possible.

I suspect you have some setup mistake on both nodes, maybe the exact same one...

You should probably provide your full logs from pacemaker/corosync with timing
information so we can check all the messages coming from PAF from the very
beginning of the startup attempt.

>         have-watchdog=false \

you should probably consider to setup watchdog in your cluster.

>         stonith-enabled=false \

This is really bad. Your cluster will NOT work as expected. PAF **requires**
Stonith to be enabled and to properly working. Without it, soon or later, you
will experience some unexpected reaction from the cluster (freezing all
actions, etc).

>         no-quorum-policy=ignore \

You should not ignore quorum, even in a two node cluster. See "two_node"
parameter in the manual of corosync.conf.

>         migration-threshold=1 \
> rsc_defaults rsc_defaults-options: \
>         migration-threshold=5 \

The later is the supported way to set migration-threshold. Your
"migration-threshold=1" should not be a cluster property but a default
ressource option.

> My pcs Config
> Corosync Nodes:
> dcmilphlum223 dcmilphlum224
> Pacemaker Nodes:
> dcmilphlum223 dcmilphlum224
> 
> Resources:
> Master: pgsql-ha
>   Meta Attrs: notify=true target-role=Stopped

This target-role might have been set by the cluster because it can not fence
nodes (which might be easier to deal with in your situation btw). That means
the cluster will keep this resource down because of previous errors.

> recovery_template=/pgsql/data/pg7000/recovery.conf.pcmk

You should probably not put your recovery.conf.pcmk in your PGDATA. Both files
are different between each nodes. As you might want to rebuild the standby or
old master after some failures, you would have to correct it each time. Keep it
outside of the PGDATA to avoid this useless step.

> dcmilphlum224: pgsqld-data-status=LATEST

I suppose this comes from the "pgsql" resource agent, definitely not from PAF...

Regards,