On Wednesday, March 16, 2016, Thomas Munro <thomas.munro@xxxxxxxxxxxxxxxx> wrote:
> In asynchronous replication, the primary writes to the WAL and flushes the disk. Then, for any standbys that happen to be connected, a WAL sender process trundles along behind feeding new WAL doesn the socket as soon as it can, but it can be running arbitrarily far behind or not running at all (the network could be down or saturated, the standby could be temporarily down or up but not reading the stream fast enough, etc etc).
Thanks for your help on finding the code. To be more precise, in the 9.1.8 code, I see this:
1. [backend] WAL is flushed to disk
2. [backend] WAL-senders are sent SIGUSR1 to wake up
3. [backend] wait for responses from other SyncRep-Receiver, effectively skipped if none
[wal-sender] wakes up
4. [backend] end-of-xact cycle
[wal-sender] reads WAL (XLogRead) up to MAX_SEND_SIZE (or less) bytes
5. [backend] ? is there an ACK send to client?
[wal-sender] sends chunk to WAL-receiver using the pq_putmessage_noblock call
6. [wal-sender] repeats reading-sending loop
So if the WAL record is bigger than whatever MAX_SEND_SIZE is (in my source, I seek 8k * 16 = 128 kB, so 1 Mb (roughly)), the WAL may end up sleeping (between iterations of 5 and 6).
On Wed, Mar 16, 2016 at 10:21 AM, otheus uibk <otheus.uibk@xxxxxxxxx> wrote:
Section 25.2.5. "The standby connects to the primary, which streams WAL records to the standby as they're generated, without waiting for the WAL file to be filled."
Section 25.2.6 "If the primary server crashes then some transactions that were committed may not have been replicated to the standby server, causing data loss. The amount of data loss is proportional to the replication delay at the time of failover."
Both these statements, then, from the documentation perspective, are incorrect, at least to a pedant. For 25.2.5, The primary streams WAL records to the standby after they've been flushed to disk but without waiting for the file to be filled. For 25.2.6 it's not clear: some transactions that were *written* to the local WAL and reported as committed but not yet *sent* to the standby server is possible.
Somehow, the documentation misleads (me) to believe the async replication algorithm at least guarantees WAL records are *sent* before responding "committed" to the client. I now know this is not the case. *grumble*.
How can I help make the documentation clearer on this point?