streaming replication timeout error

高健 <luckyjackgao@xxxxxxxxx> · Wed, 9 Oct 2013 10:58:37 +0800

Hello:

My customer encountered some connection timeout, while using one primary-one standby streaming replication.

The original log is japanese, because there are no error-code like oracle's ora-xxx,
I tried to translate the japanese information into English, But that might be not correct English for PG.

The most important part is:

2013-09-22 09:52:47 JST[28297][51d1fbcb.6e89-2][0][XX000]FATAL:  Could not receive data from WAL stream: could not receive data from server: connection timeout
scp: /opt/PostgresPlus/9.2AS/data/arch/000000AC000001F10000004A: No such file or directory

I was asked about:
In what occasion will the above fatal error occur?

I looked into the postgresql.conf file for the primary and standby server.
And made some experiences.

I found: 
Senario I:

If the wal file wanted is removed manually：

Both in primary and standby, log will be like this:
FATAL:  could not receive data from WAL stream: FATAL:  requested WAL segment 000000010000000000000011 has already been removed

Senario II:

If replication_timeout=60, wal_receiver_status_interval=0,
log will be like this:

On primary side：
LOG:  terminating walsender process due to replication timeout

On standby side：
FATAL: could not receive data from WAL stream: 
LOG:  streaming replication successfully connected to primary

But they are not the same to my customer's case.
And I also noticed that my customer's recovery.conf:

In it it has primary_conninfo as this:
host=DB1 port=5432 application_name=testpg user=postgres connect_timeout=10 keepalives_idle=5 keepalives_interval=1 

That is to say:
      keepalives_idle=5 
      keepalives_interval=1
      keepalives_count=3

With that setting, I think that For every 5 seconds,
In order to see whether the connection is alive, 3 keepalives package will be sent,
And if there is no repsonse in 1 seconds, then he connection will be invalid.

If my unserstanding is right, 
when master is busy at network,
the three keepalives_  parameter will also cause the connection time out error.

But I haven't found a  good explanation fitting the logs' FATAL error.
Can anybody give me some info?

2013-09-22 09:40:42 JST[27987][412d33c5.2d23-95024][0][00000]LOG:  Recovery replay at point 1F1/45EF8E20
2013-09-22 09:40:42 JST[27987][412d33c5.2d23-95025][0][00000]DETAIL:  Last transaction log time 2013-09-22 09:40:42.673062+09
2013-09-22 09:44:00 JST[27987][412d33c5.2d23-95026][0][00000]LOG:  restartpoint starting: time
2013-09-22 09:45:45 JST[27987][412d33c5.2d23-95027][0][00000]LOG:  restartpoint complete: wrote 1051 buffers (0.3%); 0 transaction log file(s) added, 0 removed, 0 recycled; write=104.773 s, sync=0.003 s, total=104.781 s; sync files=90, longest=0.001 s, average=0.000 s
2013-09-22 09:45:45 JST[27987][412d33c5.2d23-95028][0][00000]LOG:  Recovery replay at point 1F1/45EF8E20
2013-09-22 09:45:45 JST[27987][412d33c5.2d23-95029][0][00000]DETAIL:  Last transaction log time 2013-09-22 09:45:44.77683+09
2013-09-22 09:49:00 JST[27987][412d33c5.2d23-95030][0][00000]LOG:  restartpoint starting: time
2013-09-22 09:50:36 JST[27987][412d33c5.2d23-95031][0][00000]LOG:  restartpoint complete: wrote 963 buffers (0.3%); 0 transaction log file(s) added, 0 removed, 2 recycled; write=96.096 s, sync=0.006 s, total=96.106 s; sync files=98, longest=0.002 s, average=0.000 s
2013-09-22 09:50:36 JST[27987][412d33c5.2d23-95032][0][00000]LOG:  Recovery replay at point 1F1/46CE8EA0
2013-09-22 09:50:36 JST[27987][412d33c5.2d23-95033][0][00000]DETAIL:  Last transaction log time 2013-09-22 09:50:36.564966+09
2013-09-22 09:52:47 JST[28297][51d1fbcb.6e89-2][0][XX000]FATAL:  Could not receive data from WAL stream: could not receive data from server: connection timeout
scp: /opt/PostgresPlus/9.2AS/data/arch/000000AC000001F10000004A: No such file or directory
2013-09-22 09:52:50 JST[27806][51d1fbc5.6c9e-12][0][00000]LOG:  Log file 497、Segment 74、Offset 12263424 Page address 1F0/8DBB2000 is not expected
scp: /opt/PostgresPlus/9.2AS/data/arch/000000AC000001F10000004A: No such file or directory
scp: /opt/PostgresPlus/9.2AS/data/arch/000000AD.history: No such file or directory
2013-09-22 09:52:51 JST[2878][523e3f63.b3e-1][0][00000]LOG:  Streaming replication connected to primary server successfully
2013-09-22 09:54:00 JST[27987][412d33c5.2d23-95034][0][00000]LOG:  restartpoint starting: time
2013-09-22 09:56:30 JST[27987][412d33c5.2d23-95035][0][00000]LOG:  restartpoint complete: wrote 9929 buffers (3.2%); 0 transaction log file(s) added, 0 removed, 10 recycled; write=149.436 s, sync=0.032 s, total=149.472 s; sync files=90, longest=0.012 s, average=0.000 s
2013-09-22 09:56:30 JST[27987][412d33c5.2d23-95036][0][00000]LOG:  Recovery replay at point 1F1/48E9D238
2013-09-22 09:56:30 JST[27987][412d33c5.2d23-95037][0][00000]DETAIL:  Last transaction log time 2013-09-22 09:56:29.015514+09

Thanks in advance

jian gao