Re: How to Qualifying or quantify risk of loss in asynchronous replication

Thomas Munro <thomas.munro@xxxxxxxxxxxxxxxx> · Wed, 16 Mar 2016 12:14:05 +1300

On Wed, Mar 16, 2016 at 6:26 AM, otheus uibk <otheus.uibk@xxxxxxxxx> wrote:
> I've been working with PG 9.1.8 for two years now, mainly asynchronous
> replication. Recently, an IT admin of another group contested that the PG's
> asynchronous replication can result in loss of data in a 1-node failure.
> After re-readinG the documentation, I cannot determine to what extent this
> is true.

It is true.  If the primary server is destroyed by a meteor, it is
entirely possible for recently written WAL records to be lost, because
they haven't even been sent to an asynchronous standby node yet, let
alone written.

> Back in 2008, Robert Haas made this post
> http://postgresql.nabble.com/Sync-Rep-First-Thoughts-on-Code-tp1998339p1998433.html
> in which he delineates between different levels of replication. 1-safe is
> guaranteed with PG WALs. Other possibilities include group-safe, both group
> safe and 1-safe, 2-safe.
>
> How do we qualify PG when WALs are written (and archived) on the master, and
> streaming replication to a hot standby, albeit asynchronous, is used? Is it
> Group-safe?
>
> My understanding is "Strictly speaking, No".

No.  There is no guarantee that any other node knows about your transaction.

> But what precisely is the algorithm and timing involved with streaming WALs?
>
> Is it:
>   * client issues COMMIT
>   * master receives commit
>   * master processes transaction internally
>   * master creates WAL record
>   | master appends WAL to local WAL file, flushes disk
>   | master sends WAL record to all streaming clients
>   * master sends OK to client
>   * master applies WAL
>
> So is this correct? Is it correct to say: PG async guarantees that the WAL
> is *sent* to the receivers, but not that they are received, before the
> client receives acknowledgement?

Async replication doesn't guarantee anything at all about receivers,
or even that there is one right at this moment.  Did you mean to write
"synchronous" instead of "asynchronous"?  In asynchronous replication,
the primary writes to the WAL and flushes the disk.  Then, for any
standbys that happen to be connected, a WAL sender process trundles
along behind feeding new WAL doesn the socket as soon as it can, but
it can be running arbitrarily far behind or not running at all (the
network could be down or saturated, the standby could be temporarily
down or up but not reading the stream fast enough, etc etc).

> Can we make a case stronger than that? Assuming the T0 is round trip time
> between master and client, and T1 is round trip time between master and
> slave, as long as T1 <= T0, and provided both Slave and Master do not fail,
> the system is Group-safe?

You might use that kind of thinking to reason about the probability
that a transaction has reached the standby in async mode by the time
your client gets a commit response back, but it's not any kind of
useful guarantee, and in any case only applies if your standby is
currently connected and keeping up.  That's why we have synchronous
replication.

If you turn on synchronous replication (using the
synchronous_standby_names GUC which takes a list of standby
'application' names, or * for any) then you can stop COMMIT from
returning until one standby from that list has written out the WAL
record.  (In future we will probably support more than one).  That
works using the same approach as asynchronous replication, except that
there is a wait inserted into the committing transaction: it waits for
the current synchronous standby to report back that it has processed
the commit record.  There are two levels: synchronous_commit =
remote_write, meaning that the chosen standby has written the WAL but
not necessarily flushed it to disk yet (I think this may be called
"1-safe and group safe" in the terminology you referenced: it's
flushed locally AND betting that the other machine(s) won't (all)
crash), and synchronous_commit = on, meaning that the chosen standby
has written it actually flushed it to disk (with fsync, fdatasync etc,
"2-safe").  The former might be faster, but could lose writes that are
in the OS page cache on the standby if power is lost before those
pages eventually hit the disk, so "on" is probably what most people
mean when they talk about synchronous replication.  Asynchronous
replication doesn't wait for anything except the local disk (so it is
"1-safe").

Waiting for the transaction to be durably stored (flushed to disk) on
two servers before COMMIT returns means that you can avoid this
situation:

1.  You commit a transaction, and COMMIT returns as soon as the WAL is
flushed to disk on the primary.
2.  You communicate a fact based on that transaction to a third party
("Thank you Dr Bowman, you are booked in seat A4, your reservation
number is JUPITER123").
3.  Your primary computer is destroyed by a meteor, and its WAL sender
hadn't yet got around to sending that transaction to the standby.
4.  You recover using the standby.
5.  The transaction has been forgotten ("I'm sorry Dave, I'm afraid we
have no record of booking JUPITER123, and the rocket is full.  The
next rocket leaves in 7 years, would you like to book a seat?").

If you enable synchronous replication, and you are careful to recover
in step 4 using the correct standby, then you can't lose a transaction
that you reported to external systems *after* (because) COMMIT
returned.  If your primary is destroyed after you executed COMMIT, but
*before* it returned, it is possible that the current synchronous
standby's WAL contains the transaction or doesn't contain the
transaction, but not for you to have taken any external action based
on the commit having returned, because it didn't.  (If your primary
crashes and restarts before COMMIT returns, and it had got as far as
flushing locally but not yet heard from the standby, then things may
be slightly more complicated).

-- 
Thomas Munro
http://www.enterprisedb.com

-- 
Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general