Re: logical replication problems (10.4)

Achilleas Mantzios <achill@xxxxxxxxxxxxxxxxxxxxx> · Mon, 6 Aug 2018 11:29:31 +0300



    Hello,

      
      yesterday, I increased both wal_sender_timeout and
      wal_receiver_timeout , from '1 min' to '5 min' on publisher /
      subscriber sides, then I dropped the subscription, I truncated all
      tables on the subscriber and recreated the subscription.

      In less than 9 hours all 1.2T worth of contents had been
      transferred to the subscriber, with no errors, I checked row
      counts everything looks good.

      
      Everything in pg_stat_replication, pg_stat_subscription ,
      pg_replication_slots , looks good.

      
      Unfortunately the day before I hadn't yet made myself comfortable
      with all the monitoring involved, therefore I might have missed
      something in my first email, or indeed may have done smth wrong.
      So I am not sure if only the increase in timeout might have such a
      tremendous impact on the reliability of the logical replication
      system. And it case it has an impact, I am sorry that I wasn't
      savvy enough to provide more info (or at least trigger an answer
      in this thread).

      
      Anyway, I am planning to add some tables to the publication.  I
      guess, I should add the tables to the publication, then 

      ALTER SUBSCRIPTION data_for_testsmadb_pub REFRESH PUBLICATION WITH (refresh=true) ;
      

      On 03/08/2018 14:48, Achilleas Mantzios wrote:

    
    Hello,
      

      yesterday we did the first effort to deploy our test database
      server (aimed at being used by developers and testers) using
      logical replication, using a publication for selected tables (563
      in total with some of them being pretty large with bytea atts).
      

      Expected total DB size : about 1.2 TB
      

      I started it (created the subscription) at about 2018-08-02
      11:35:00. By 2018-08-02 22:02:43.778 I got the last logical
      replication-related message on the subscriber side:
      

      2018-08-02 22:02:43.778 EEST  @ line:1 LOG:  logical replication
      table synchronization worker for subscription
      "data_for_testsmadb_pub", table "moc_final_persons" has started
      

      On the publisher side I got :
      

      2018-08-02 22:06:07.726 EEST data_for_testsmadb_pub repmgr@dynacom
      line:7 LOG:  terminating walsender process due to replication
      timeout
      

      2018-08-02 22:06:07.726 EEST data_for_testsmadb_pub repmgr@dynacom
      line:8 LOG:  disconnection: session time: 0:04:34.968 user=repmgr
      database=dynacom host=10.9.0.77 port=47820
      

      and later
      

      2018-08-02 22:10:15.153 EEST
      data_for_testsmadb_pub_33650_sync_22294 repmgr@dynacom line:5
      LOG:  duration: 25428550.886 ms  statement: COPY
      public.marinerpapers_atts TO STDOUT
      

      The nightly barman differential backup took more than twice the
      time and space than usual, totaling to some 500GB (whereas usually
      it is about 200 GB).
      

      In the morning I saw no activity. pg_stat_subscription showed last
      send / receipt time yesterday night (unfortunately I didn't keep
      this info).
      

      Then I reacted by restarting the subscriber database and this
      seemed to "wake up" the logical replication.
      

      Still this is in progress syncing a large table :
      marinerpapers_atts (about 400GB) (hopefully the last?).
      

      I don't know if all this is going according to the
      intended/documented behavior.
      

      Currently I have 2 wal sender / logical replication worker pairs :
      

      one which is doing the streaming and from the looks of it it is
      pretty close to the LSNs of your upstream physical standby and
      

      one which still tries to sync this very large table :
      marinerpapers_atts .
      

      While I was at it I did some tests on a similar table named :
      marinerpapers_atts_tmp , and here is what I get to my surprise :
      

      Publisher :
      

      select count(*) from marinerpapers_atts_tmp;
      

       count
      

      --------
      

       673391
      

      (1 row)
      

      Subscriber :
      

      select count(*) from marinerpapers_atts_tmp;
      

       count
      

      --------
      

       673377
      

      (1 row)
      

      There is a 14 difference in row count, which seems to be
      consistent, if retry those counts in a while I'll get again 14
      difference in row count. I diffed two times and found out the 14
      missing rows. I got no errors regarding this table, neither on the
      publisher nor on the subscriber :
      

      Publisher : egrep -e 'data_for.*22304'
      data/log/postgresql-2018-08-02.log
      

      10.9.0.77(47819) [113746] 5b62f9e9.1bc52 2018-08-02 15:32:41.181
      EEST data_for_testsmadb_pub_33650_sync_22304 repmgr@dynacom line:3
      LOG:  logical decoding found consistent point at FE5/53827A50
      

      10.9.0.77(47819) [113746] 5b62f9e9.1bc52 2018-08-02 15:32:41.181
      EEST data_for_testsmadb_pub_33650_sync_22304 repmgr@dynacom line:4
      DETAIL:  There are no running transactions.
      

      10.9.0.77(47819) [113746] 5b62f9e9.1bc52 2018-08-02 22:01:27.502
      EEST data_for_testsmadb_pub_33650_sync_22304 repmgr@dynacom line:5
      LOG:  duration: 23326314.921 ms  statement: COPY
      public.marinerpapers_atts_tmp TO STDOUT
      

      10.9.0.77(47819) [113746] 5b62f9e9.1bc52 2018-08-02 22:01:27.669
      EEST data_for_testsmadb_pub_33650_sync_22304 repmgr@dynacom line:6
      LOG:  disconnection: session time: 6:28:46.565 user=repmgr
      database=dynacom host=10.9.0.77 port=47819
      

      Subscriber : egrep -e 'marinerpapers_atts_tmp'
      data/log/postgresql-02.log
      

       [34052] 5b62f9e9.8504 2018-08-02 15:32:41.043 EEST  @ line:1
      LOG:  logical replication table synchronization worker for
      subscription "data_for_testsmadb_pub", table
      "marinerpapers_atts_tmp" has started
      

       [34052] 5b62f9e9.8504 2018-08-02 22:01:27.663 EEST  @ line:2
      LOG:  logical replication table synchronization worker for
      subscription "data_for_testsmadb_pub", table
      "marinerpapers_atts_tmp" has finished
      

      No error at all, but I got those 14 rows missing .
      

    -- 
Achilleas Mantzios
IT DEV Lead
IT DEPT
Dynacom Tankers Mgmt