Streaming replication with sync slave, but disconnects due to missing WAL segments

Mads.Tandrup@xxxxxxxxxxxxxxxxxxxxxx · Tue, 4 Jun 2013 15:25:47 +0200

Hi all

I have a question about sync streaming replication.

I have 2 postgresql 9.1 servers set up with streaming replication. On the
master node the slave is configured as a synchronous standby. I've verified
that pg_stat_replication shows sync_state = sync for the slave node.

It all seems to work fine. But I have noticed that sometimes when I restore
backups created by pg_dump. The slave node will disconnect with the message
in the postgresql log:
2013-06-03 13:13:48 GMT 4271  FATAL:  could not receive data from WAL
stream: SSL connection has been closed unexpectedly
2013-06-03 13:13:53 GMT 4270  LOG:  invalid magic number 0000 in log file
15, segment 65, offset 11665408
2013-06-03 13:13:54 GMT 36428  LOG:  streaming replication successfully
connected to primary
2013-06-03 13:13:54 GMT 36428  FATAL:  could not receive data from WAL
stream: FATAL:  requested WAL segment 000000010000000F00000041 has already
been removed
2013-06-03 13:13:58 GMT 36458  LOG:  streaming replication successfully
connected to primary
2013-06-03 13:13:58 GMT 36458  FATAL:  could not receive data from WAL
stream: FATAL:  requested WAL segment 000000010000000F00000041 has already
been removed

On the master I get this in the log file in the same timespan:
2013-06-03 13:13:47 GMT 1471  LOG:  checkpoints are occurring too
frequently (2 seconds apart)
2013-06-03 13:13:47 GMT 1471  HINT:  Consider increasing the configuration
parameter "checkpoint_segments".
2013-06-03 13:13:48 GMT 6189 [unknown] FATAL:  requested WAL segment
000000010000000F00000041 has already been removed
2013-06-03 13:13:48 GMT 6189 [unknown] LOG:  disconnection: session time:
77:37:37.684 user=root database= host=10.216.80.38 port=56114
2013-06-03 13:13:49 GMT 1471  LOG:  checkpoints are occurring too
frequently (2 seconds apart)
2013-06-03 13:13:49 GMT 1471  HINT:  Consider increasing the configuration
parameter "checkpoint_segments".
2013-06-03 13:13:51 GMT 1471  LOG:  checkpoints are occurring too
frequently (2 seconds apart)
2013-06-03 13:13:51 GMT 1471  HINT:  Consider increasing the configuration
parameter "checkpoint_segments".
2013-06-03 13:13:51 GMT 1468  LOG:  received SIGHUP, reloading
configuration files
2013-06-03 13:13:51 GMT 1468  LOG:  parameter "synchronous_standby_names"
removed from configuration file, reset to default
2013-06-03 13:13:53 GMT 1471  LOG:  checkpoints are occurring too
frequently (2 seconds apart)
2013-06-03 13:13:53 GMT 1471  HINT:  Consider increasing the configuration
parameter "checkpoint_segments".
2013-06-03 13:13:53 GMT 44063 [unknown] LOG:  connection received:
host=10.216.80.38 port=34038
2013-06-03 13:13:54 GMT 44063 [unknown] LOG:  replication connection
authorized: user=root
2013-06-03 13:13:54 GMT 44063 [unknown] FATAL:  requested WAL segment
000000010000000F00000041 has already been removed
2013-06-03 13:13:54 GMT 44063 [unknown] LOG:  disconnection: session time:
0:00:00.090 user=root database= host=10.216.80.38 port=34038

What I don't understand is how the slave node can miss a WAL segment since
it should be sync?

Shouldn't sync prevent the server from continuing if the slave is not able
to get WAL segments fast enough?

I have only noticed it while restoring a database. But the general load on
the DB has not been that high, so I'm not sure if it can occur with other
workloads.

Best regards,
Mads

-- 
Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general