Re: Streaming replication - 11.5

Adrian Klaver <adrian.klaver@xxxxxxxxxxx> · Wed, 11 Mar 2020 10:57:46 -0700

On 3/11/20 2:54 AM, Nicola Contu wrote:
These are the lines before

2020-03-11 09:05:08 GMT [127.0.0.1(40214)] [43853]: [1-1] 
db=cmdv3,user=zabbix_check ERROR:  recovery is in progress
2020-03-11 09:05:08 GMT [127.0.0.1(40214)] [43853]: [2-1] 
db=cmdv3,user=zabbix_check HINT:  WAL control functions cannot be 
executed during recovery.
2020-03-11 09:05:08 GMT [127.0.0.1(40214)] [43853]: [3-1] 
db=cmdv3,user=zabbix_check STATEMENT:  select 
greatest(0,pg_wal_lsn_diff(pg_current_wal_lsn(), replay_lsn)) from 
pg_stat_replication where client_addr ='10.150.20.22'

That query is made by Zabbix. So I stopped the zabbix agent and tested 
again. But still failing, because of this now :

pg_basebackup: starting background WAL receiver
pg_basebackup: created temporary replication slot "pg_basebackup_51199"
*pg_basebackup: could not receive data from WAL stream: SSL SYSCALL 
error: EOF detected
*^C4699810/504983062 kB (70%), 0/1 tablespace 
(...ql11/data/base/16401/231363544.2)

So you started over with a pg_basebackup?

Also from below:

2020-03-11 09:43:53 GMT [10.150.20.22(54906)] [51199]: [1-1] 
db=[unknown],user=replicator LOG:  terminating walsender process due to 
replication timeout

Where are the master and standby in relation to each other network wise?

Intervening firewalls, network latency issues?

here the full log starting right before the last try :

2020-03-11 09:22:44 GMT [] [12598]: [4508-1] db=,user= LOG: 
  restartpoint complete: wrote 19565 buffers (0.2%); 0 WAL file(s) 
added, 0 removed, 7 recycled; write=270.014 s, sync=0.009 s, 
total=270.036 s; sync files=804, longest=0.001 s, average=0.000 s; 
distance=131239 kB, estimate=725998 kB
2020-03-11 09:22:44 GMT [] [12598]: [4509-1] db=,user= LOG:  recovery 
restart point at 643A/D8C05F70
2020-03-11 09:22:44 GMT [] [12598]: [4510-1] db=,user= DETAIL:  Last 
completed transaction was at log time 2020-03-11 09:22:44.050084+00.
2020-03-11 09:23:14 GMT [] [12598]: [4511-1] db=,user= LOG: 
  restartpoint starting: time
2020-03-11 09:27:44 GMT [] [12598]: [4512-1] db=,user= LOG: 
  restartpoint complete: wrote 17069 buffers (0.2%); 0 WAL file(s) 
added, 0 removed, 17 recycled; write=269.879 s, sync=0.006 s, 
total=269.902 s; sync files=811, longest=0.001 s, average=0.000 s; 
distance=120469 kB, estimate=665445 kB
2020-03-11 09:27:44 GMT [] [12598]: [4513-1] db=,user= LOG:  recovery 
restart point at 643A/E01AB438
2020-03-11 09:27:44 GMT [] [12598]: [4514-1] db=,user= DETAIL:  Last 
completed transaction was at log time 2020-03-11 09:27:43.945485+00.
2020-03-11 09:27:44 GMT [] [12598]: [4515-1] db=,user= LOG: 
  restartpoint starting: force wait
2020-03-11 09:29:24 GMT [10.222.8.2(47834)] [50961]: [1-1] 
db=cmdv3,user=nis LOG:  duration: 1402.004 ms  statement: SELECT id, 
name, parent_id, parent, short_name, sales_rep_id FROM mmx_clients;
2020-03-11 09:29:34 GMT [10.222.8.2(47834)] [50961]: [2-1] 
db=cmdv3,user=nis LOG:  duration: 9493.259 ms  statement: SELECT slid, 
gnid, sof_id, client_id, product FROM mmx_slids;
2020-03-11 09:32:14 GMT [] [12598]: [4516-1] db=,user= LOG: 
  restartpoint complete: wrote 71260 buffers (0.8%); 0 WAL file(s) 
added, 0 removed, 13 recycled; write=269.953 s, sync=0.012 s, 
total=269.979 s; sync files=760, longest=0.002 s, average=0.000 s; 
distance=123412 kB, estimate=611242 kB
2020-03-11 09:32:14 GMT [] [12598]: [4517-1] db=,user= LOG:  recovery 
restart point at 643A/E7A30498
2020-03-11 09:32:14 GMT [] [12598]: [4518-1] db=,user= DETAIL:  Last 
completed transaction was at log time 2020-03-11 09:32:13.916101+00.
2020-03-11 09:32:44 GMT [] [12598]: [4519-1] db=,user= LOG: 
  restartpoint starting: time
2020-03-11 09:37:14 GMT [] [12598]: [4520-1] db=,user= LOG: 
  restartpoint complete: wrote 27130 buffers (0.3%); 0 WAL file(s) 
added, 0 removed, 12 recycled; write=270.026 s, sync=0.007 s, 
total=270.052 s; sync files=814, longest=0.001 s, average=0.000 s; 
distance=280595 kB, estimate=578177 kB
2020-03-11 09:37:14 GMT [] [12598]: [4521-1] db=,user= LOG:  recovery 
restart point at 643A/F8C351C8
2020-03-11 09:37:14 GMT [] [12598]: [4522-1] db=,user= DETAIL:  Last 
completed transaction was at log time 2020-03-11 09:37:14.067443+00.
2020-03-11 09:37:44 GMT [] [12598]: [4523-1] db=,user= LOG: 
  restartpoint starting: time
2020-03-11 09:42:14 GMT [] [12598]: [4524-1] db=,user= LOG: 
  restartpoint complete: wrote 26040 buffers (0.3%); 0 WAL file(s) 
added, 0 removed, 9 recycled; write=269.850 s, sync=0.019 s, 
total=269.886 s; sync files=834, longest=0.002 s, average=0.000 s; 
distance=236392 kB, estimate=543999 kB
2020-03-11 09:42:14 GMT [] [12598]: [4525-1] db=,user= LOG:  recovery 
restart point at 643B/730F3F8
2020-03-11 09:42:14 GMT [] [12598]: [4526-1] db=,user= DETAIL:  Last 
completed transaction was at log time 2020-03-11 09:42:13.900088+00.
2020-03-11 09:42:44 GMT [] [12598]: [4527-1] db=,user= LOG: 
  restartpoint starting: time
2020-03-11 09:43:53 GMT [10.150.20.22(54906)] [51199]: [1-1] 
db=[unknown],user=replicator LOG:  terminating walsender process due to 
replication timeout
2020-03-11 09:47:14 GMT [] [12598]: [4528-1] db=,user= LOG: 
  restartpoint complete: wrote 20966 buffers (0.2%); 0 WAL file(s) 
added, 0 removed, 9 recycled; write=270.048 s, sync=0.014 s, 
total=270.085 s; sync files=852, longest=0.001 s, average=0.000 s; 
distance=183749 kB, estimate=507974 kB
2020-03-11 09:47:14 GMT [] [12598]: [4529-1] db=,user= LOG:  recovery 
restart point at 643B/12680A80
2020-03-11 09:47:14 GMT [] [12598]: [4530-1] db=,user= DETAIL:  Last 
completed transaction was at log time 2020-03-11 09:47:14.069731+00.
2020-03-11 09:47:44 GMT [] [12598]: [4531-1] db=,user= LOG: 
  restartpoint starting: time

Il giorno mer 11 mar 2020 alle ore 01:53 Adrian Klaver 
<adrian.klaver@xxxxxxxxxxx <mailto:adrian.klaver@xxxxxxxxxxx>> ha scritto:

    On 3/10/20 8:17 AM, Nicola Contu wrote:
    Please post to list also.
    Ccing list.

    What came immediately before the temporary file error?

     >    2020-03-10 15:10:17 GMT [[local]] [28171]: [1-1]
     > db=postgres,user=postgres LOG:  temporary file: path
     > "base/pgsql_tmp/pgsql_tmp28171.0", size 382474936
     > 2020-03-10 15:10:17 GMT [[local]] [28171]: [4-1]
     > db=postgres,user=postgres LOG:  could not send data to client:
    Broken pipe
     > 2020-03-10 15:10:17 GMT [[local]] [28171]: [5-1]
     > db=postgres,user=postgres FATAL:  connection to client lost
     > 2020-03-10 15:10:26 GMT [] [12598]: [3544-1] db=,user= LOG:
     >   restartpoint complete: wrote 37315 buffers (0.4%); 0 WAL file(s)
     > added, 0 removed, 16 recycled; write=269.943 s, sync=0.039 s,
     > total=269.999 s; sync files=1010, longest=0.001 s, average=0.000 s;
     > distance=175940 kB, estimate=416149 kB
     > 2020-03-10 15:10:26 GMT [] [12598]: [3545-1] db=,user= LOG:
      recovery
     > restart point at 6424/1D7DEDE8
     >
     > It is a cascade replication
     >
     > Il giorno mar 10 mar 2020 alle ore 15:58 Adrian Klaver
     > <adrian.klaver@xxxxxxxxxxx <mailto:adrian.klaver@xxxxxxxxxxx>
    <mailto:adrian.klaver@xxxxxxxxxxx
    <mailto:adrian.klaver@xxxxxxxxxxx>>> ha scritto:
     >
     >     On 3/10/20 2:26 AM, Nicola Contu wrote:
     >      > Hello,
     >      > I have two servers connected to the same switch running
    postgres 11.5
     >      >
     >      > I am trying to replicate one of those servers after a planned
     >     work on
     >      > the master, so the replica has been lost. It has always worked
     >     but now I
     >      > get this :
     >      >
     >      > pg_basebackup: could not receive data from WAL stream: server
     >     closed the
     >      > connection unexpectedly
     >      >          This probably means the server terminated abnormally
     >      >          before or while processing the request.
     >      >
     >      > I don't really understand what the issue is.
     >
     >     I would start with the logs from the Postgres server you are
    taking the
     >     backup from.
     >
     >      > I had this issue last week as well in another DC and I had to
     >     reboot the
     >      > slave to make it working (not sure why it helped)
     >      >
     >      > Do you know what can cause this?
     >      >
     >      > Thank you,
     >      > Nicola
     >
     >
     >     --
     >     Adrian Klaver
     > adrian.klaver@xxxxxxxxxxx <mailto:adrian.klaver@xxxxxxxxxxx>
    <mailto:adrian.klaver@xxxxxxxxxxx <mailto:adrian.klaver@xxxxxxxxxxx>>
     >

    -- 
    Adrian Klaver
    adrian.klaver@xxxxxxxxxxx <mailto:adrian.klaver@xxxxxxxxxxx>

--
Adrian Klaver
adrian.klaver@xxxxxxxxxxx