Re: In-place upgrade with streaming replicas

richard@xxxxxxxxx · Thu, 20 Feb 2025 08:55:31 +0100

Dear Jerry,

So, yes it turns out that some kind of loop must be involved here, as 
you described:

1. ensure cluster is running
2. stop primary
3. wait some time
4. stop replicas
5. check if checkpoint locations match. repeat from step 1 if 
out-of-sync.

My question here is, the unreliable step here is 3rd one. Can we query 
the replica runtime if he did catch up? I mean, that after stopping the 
primary, we can obtain the checkpoint location from pg_controldata, 
then, can we somehow query the running replica about that?

Thanks in advance,
Richard

2025-02-20 08:49 időpontban Jerry Sievers ezt írta:
richard@xxxxxxxxx writes:

Dear Alvaro,

Thanks for your answers. Unfortunately, I was unaware of a shutdown
record, that makes a difference then. So, I definitely must stop the
primary first, then use pg_controldata to obtain checkpoint
info. Then, can I query the replicas while they are up and running if
they've received the shutdown record or not? So, after shutting down
the primary, how will I know if a replica has received the mentioned
record, and is safe to shutdown?

Hmmm, not sure about that but what we do, is stop primary, wait a
$short time, then stop replicas...

Then run pg_controldata on all nodes | filter out only the line
indicating latest checkpoint and sort -u the output.  Expect only a
single line if all are matched.

You may also wish to first insure that you got the same number of
lines as total node count before doing the sorting and uniqueing.

Very rarely on our huge systems, we'd have a mismatch after the
verification in in those cases, our automated upgrade procedure
restarts all nodes and then does the shutdown and verify check again.

HTH

Thanks for the clarifications.

Best regards,
Richard

2025-02-19 16:54 időpontban Álvaro Herrera ezt írta:
On 2025-Feb-19, richard@xxxxxxxxx wrote:

With this, I have the question, that after the shutdown of primary,
what is
the guarantee for replicas having the same checkpoint location? Why
does the
order of shutting down the servers matter? What would be the really
exact
and reliable way to ensure that replicas will have the same 
checkpoint
location as the primary?
The replicas can't write WAL by themselves, but they will replay
whatever the primary has sent; by shutting down the primary first and
letting the replicas catch up, you ensure that the replicas will
actually receive the shutdown record and replay it.  If you shut down
the replicas first, they can obviously never catch up with the 
shutdown
checkpoint of the primary.
As I recall, if you do shut down the primary first, one potential
danger
is that the primary fails to send the checkpoint record before 
shutting
down, so the replicas won't receive it and obviously will not replay
it;
or simply that they are behind enough that they receive it but don't
replay it.
You could use pg_controldata to read the last checkpoint info from
all
nodes.  You can run it on the primary after shutting it down, and 
then
on each replica while it's still running to ensure that the correct
restartpoint has been created.