streaming replication: one problem & several questions

Lonni J Friedman <netllama@xxxxxxxxx> · Wed, 10 Aug 2011 15:19:28 -0700

Greetings,
I've got three Linux systems (each with Fedora15-x86_64 running
PostgreSQL-9.0.4).  I'm attempting to get a basic streaming
replication setup going with one master & two standby servers.  At
this point, the replication portion appears to be working.  I can run
an 'update' statement on the master, and view the result with a
'SELECT' on both standby servers.

I've mostly been working off the information presented here:
http://wiki.postgresql.org/wiki/Binary_Replication_Tutorial
http://wiki.postgresql.org/wiki/Streaming_Replication

plus the official Postgresql website documentation, with copious
googling for other sources when something didn't make sense.

First the problem.  On *only* one of the two standby servers, I'm
seeing errors like the following whenever I issue any SQL commands on
the master which write (insert, update, etc) to the database:
LOG:  invalid record length at 8/7A000020
FATAL:  terminating walreceiver process due to administrator command
LOG:  invalid record length at 8/7A0000B0
LOG:  streaming replication successfully connected to primary
LOG:  invalid record length at 8/7B000020
FATAL:  terminating walreceiver process due to administrator command
LOG:  record with zero length at 8/7B0000B0
LOG:  streaming replication successfully connected to primary
LOG:  record with incorrect prev-link 8/79000058 at 8/7D0000B0
LOG:  streaming replication successfully connected to primary

The thing that makes this even more confusing is that the data seems
to remain synchronized on both standby servers even with that error,
so I can't tell if I'm merely missing the implication of the errors,
of if they're somehow harmless?  Maybe its able to resume streaming
replication only because i have wal_keep_segments=128 ?  I googled a
bit on this, and found a few other references to these errors,
including this recent one which suggested that too much network
latency might be the problem:
http://permalink.gmane.org/gmane.comp.db.postgresql.general/153445

I should note that the standby that is exhibiting this problem is
running inside of a virtual machine, while the standby without the
problem is running on real HW.  Whenever I get to the point where I
want to push the entire setup into production, it will all be running
on real HW.

Now a few unrelated questions:
0) I've successfully setup the WAL archiving on the master, and set
archive_timeout=61.  However, what I'm seeing is that new files are
not getting generated every 61 seconds, but instead only when some
kind of SQL is invoked which writes to the database, or every 305
seconds (whichever comes first).  I know this is the case because I
wrote a script which is being invoked via the archive_command option,
and that script is logging both the timestamp and filenames each time
it is invoked.  Can I debug this?
1) Both of the wiki links above comment that the restore_command may
not be necessary if wal_keep_segments is large enough (mine is set to
128).  I was going to setup the restore_command anyway, as I'm not yet
confident enough about streaming replication and failover with
postgresql to take chances, although the fact that i have two standby
servers makes this setup a bit more complex.  However, can anyone
comment about whether its ever truly safe 100% of the time to run
without a restore_command ?

thanks!

-- 
Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general