Hello:
My customer encountered some connection timeout, while using one primary-one standby streaming replication.
The original log is japanese, because there are no error-code like oracle's ora-xxx,
I tried to translate the japanese information into English, But that might be not correct English for PG.
The most important part is:
2013-09-22 09:52:47 JST[28297][51d1fbcb.6e89-2][0][XX000]FATAL: Could not receive data from WAL stream: could not receive data from server: connection timeout
scp: /opt/PostgresPlus/9.2AS/data/arch/000000AC000001F10000004A: No such file or directory
I was asked about:
In what occasion will the above fatal error occur?
I looked into the postgresql.conf file for the primary and standby server.
And made some experiences.
I found:
Senario I:
If the wal file wanted is removed manually:
Both in primary and standby, log will be like this:
FATAL: could not receive data from WAL stream: FATAL: requested WAL segment 000000010000000000000011 has already been removed
Senario II:
If replication_timeout=60, wal_receiver_status_interval=0,
log will be like this:
On primary side:
LOG: terminating walsender process due to replication timeout
On standby side:
FATAL: could not receive data from WAL stream:
LOG: streaming replication successfully connected to primary
But they are not the same to my customer's case.
And I also noticed that my customer's recovery.conf:
In it it has primary_conninfo as this:
host=DB1 port=5432 application_name=testpg user=postgres connect_timeout=10 keepalives_idle=5 keepalives_interval=1
That is to say:
keepalives_idle=5
keepalives_interval=1
keepalives_count=3
With that setting, I think that For every 5 seconds,
In order to see whether the connection is alive, 3 keepalives package will be sent,
And if there is no repsonse in 1 seconds, then he connection will be invalid.
If my unserstanding is right,
when master is busy at network,
the three keepalives_ parameter will also cause the connection time out error.
But I haven't found a good explanation fitting the logs' FATAL error.
Can anybody give me some info?
2013-09-22 09:40:42 JST[27987][412d33c5.2d23-95024][0][00000]LOG: Recovery replay at point 1F1/45EF8E20
2013-09-22 09:40:42 JST[27987][412d33c5.2d23-95025][0][00000]DETAIL: Last transaction log time 2013-09-22 09:40:42.673062+09
2013-09-22 09:44:00 JST[27987][412d33c5.2d23-95026][0][00000]LOG: restartpoint starting: time
2013-09-22 09:45:45 JST[27987][412d33c5.2d23-95027][0][00000]LOG: restartpoint complete: wrote 1051 buffers (0.3%); 0 transaction log file(s) added, 0 removed, 0 recycled; write=104.773 s, sync=0.003 s, total=104.781 s; sync files=90, longest=0.001 s, average=0.000 s
2013-09-22 09:45:45 JST[27987][412d33c5.2d23-95028][0][00000]LOG: Recovery replay at point 1F1/45EF8E20
2013-09-22 09:45:45 JST[27987][412d33c5.2d23-95029][0][00000]DETAIL: Last transaction log time 2013-09-22 09:45:44.77683+09
2013-09-22 09:49:00 JST[27987][412d33c5.2d23-95030][0][00000]LOG: restartpoint starting: time
2013-09-22 09:50:36 JST[27987][412d33c5.2d23-95031][0][00000]LOG: restartpoint complete: wrote 963 buffers (0.3%); 0 transaction log file(s) added, 0 removed, 2 recycled; write=96.096 s, sync=0.006 s, total=96.106 s; sync files=98, longest=0.002 s, average=0.000 s
2013-09-22 09:50:36 JST[27987][412d33c5.2d23-95032][0][00000]LOG: Recovery replay at point 1F1/46CE8EA0
2013-09-22 09:50:36 JST[27987][412d33c5.2d23-95033][0][00000]DETAIL: Last transaction log time 2013-09-22 09:50:36.564966+09
2013-09-22 09:52:47 JST[28297][51d1fbcb.6e89-2][0][XX000]FATAL: Could not receive data from WAL stream: could not receive data from server: connection timeout
scp: /opt/PostgresPlus/9.2AS/data/arch/000000AC000001F10000004A: No such file or directory
2013-09-22 09:52:50 JST[27806][51d1fbc5.6c9e-12][0][00000]LOG: Log file 497、Segment 74、Offset 12263424 Page address 1F0/8DBB2000 is not expected
scp: /opt/PostgresPlus/9.2AS/data/arch/000000AC000001F10000004A: No such file or directory
scp: /opt/PostgresPlus/9.2AS/data/arch/000000AD.history: No such file or directory
2013-09-22 09:52:51 JST[2878][523e3f63.b3e-1][0][00000]LOG: Streaming replication connected to primary server successfully
2013-09-22 09:54:00 JST[27987][412d33c5.2d23-95034][0][00000]LOG: restartpoint starting: time
2013-09-22 09:56:30 JST[27987][412d33c5.2d23-95035][0][00000]LOG: restartpoint complete: wrote 9929 buffers (3.2%); 0 transaction log file(s) added, 0 removed, 10 recycled; write=149.436 s, sync=0.032 s, total=149.472 s; sync files=90, longest=0.012 s, average=0.000 s
2013-09-22 09:56:30 JST[27987][412d33c5.2d23-95036][0][00000]LOG: Recovery replay at point 1F1/48E9D238
2013-09-22 09:56:30 JST[27987][412d33c5.2d23-95037][0][00000]DETAIL: Last transaction log time 2013-09-22 09:56:29.015514+09
Thanks in advance
jian gao