Re: Trouble with replication

Jeff Janes <jeff.janes@xxxxxxxxx> · Thu, 6 Jun 2013 09:51:58 -0700

On Wed, Jun 5, 2013 at 1:39 PM, David Greco <David_Greco@xxxxxxxxxxxxxxx> wrote:

I’ve setup two 9.2.4 servers to serve as master-slave in a streaming replication scenario. I started with a  fresh database on the master, setup the replication, then imported using pg_restore about 30GB of data. The master and slave are
 geographically separated, so replication of this amount of data can/should take hours.  I saw from pg_last_xlog_receive_location and pg_last_xlog_replay_location that the slave began to receive the replication information, it eventually quit with the following
 errors in the log:

2013-06-05 16:28:43.198 EDT,,,19978,,51af9f7a.4e0a,2,,2013-06-05 16:28:42 EDT,,0,FATAL,XX000,"could not receive data from WAL stream: FATAL:  requested WAL segment 000000010000000000000022 has already been removed

",,,,,,,,,""

What are the messages before and after this?

Checking the master, I see that file has in fact been removed from the pg_xlog directory. The master has archive_command setup to ship the wal files to the slave, and the slave is setup with a recovery_command to read them from that directory.

Are you sure that these are set up correctly?  What happens if you comment out primary_conninfo, so that the archive directory is the only way to deliver the files?

In fact, that WAL segment exists in the slave’s pg_xlog directory as well.

But is the existing file identical to the one the master (and the one in the archivedir)?  It is probably a recycled file that has not yet been overwritten with received contents.  That is, it has the contents of some past log file, but the name of some future one.

Now, from what I can tell, the master archived this wal file out of its xlog directory (based on the keep wal segments setting). Then, why did the slave not pick it up from the directory that it was archived to? It is my understanding that
 the log shipping via archive_command from master to slave is precisely there to prevent this scenario. What am I doing wrong? Below are some of the pertinent settings.

In my hands, this is what happens.  After losing contact with the primary, it starts pulling files from the archive until it runs out of those, then tries to reconnect to the primary.

Cheers,

Jeff