Re: Logical replication from PostgreSQL 10.23 to 16.3, last two big tables (getting synced) just keep growing

Achilleas Mantzios <a.mantzios@xxxxxxxxxxxxxxxxxxxx> · Thu, 13 Jun 2024 06:58:08 +0300



    Στις 12/6/24 22:46, ο/η Achilleas
      Mantzios έγραψε:

    
      Στις 12/6/24 21:09, ο/η Achilleas
        Mantzios έγραψε:

      
        Στις 12/6/24 20:14, ο/η Achilleas
          Mantzios έγραψε:

        
          Hello
          as the subject says, we are trying to test a migration plan
            from 10.23 to 16.3, using logical replication to new OS, +
            checksums  + ICU + SQL_ASCII -> UTF-8, (some of which
            proved to be a challenge).  After solving all of the
            problems, and repeating the process, now it seems the system
            got into a weird state, in which all of the ~ 930 tables got
            synced correctly , except the two largest ones. The DB in
            total (in its 10.23 version) takes up 4.6 TB. In 16.3 it has
            already taken 4.8 and keep growing. No ERRORs or ROLLBACKS
            that would justify the growth (to my limited knowledge).

          
          Those two tables keep growing and growing in the 16.3
            (subscriber) DB.
          In 10.23 (publisher)

          
          postgres@[local]/dynacom=#
                select pg_total_relation_size(oid),
                pg_size_pretty(pg_total_relation_size(oid)),
                oid::regclass from pg_class where  oid in ('marin

erpapers_atts'::regclass,'marinerpapers_atts_tmp'::regclass); 

              pg_total_relation_size | pg_size_pretty |          oid
                         

------------------------+----------------+------------------------ 

                       1082956537856 | 1009 GB        |
              marinerpapers_atts 

                        973440409600 | 907 GB         |
              marinerpapers_atts_tmp 

              (2 rows) 

              
              postgres@[local]/dynacom=# 

              
            In 16.3 (subscriber)
          postgres@[local]/dynacom=#
                select pg_total_relation_size(oid),
                pg_size_pretty(pg_total_relation_size(oid)),
                oid::regclass from pg_class where  oid in ('marin

erpapers_atts'::regclass,'marinerpapers_atts_tmp'::regclass); 

              pg_total_relation_size | pg_size_pretty |          oid
                         

------------------------+----------------+------------------------ 

                       1301943779328 | 1213 GB        |
              marinerpapers_atts 

                       1295188762624 | 1206 GB        |
              marinerpapers_atts_tmp 

              (2 rows)
           The weird thing is that for every single table of the rest
            of the big ones, the size on the new machine seems smaller :
          10.23 :
          postgres@[local]/dynacom=#
                select pg_total_relation_size(oid),
                pg_size_pretty(pg_total_relation_size(oid)),
                oid::regclass from pg_class where relkind in ('r'

              ,'p','m') order by 1 DESC LIMIT 10 OFFSET 2;  

              pg_total_relation_size | pg_size_pretty |          oid
                        

------------------------+----------------+----------------------- 

                        634434142208 | 591 GB         | mail_entity 

                        568166334464 | 529 GB         | fb_reports_files
              

                        538761527296 | 502 GB         | vmanews 

                        115932397568 | 108 GB         |
              approval_request_docs 

                        107413241856 | 100 GB         | status 

                         86767763456 | 81 GB          | fb_reports_dets
              

                         82120941568 | 76 GB          | items_atts 

                         81043136512 | 75 GB          | items_atts_tmp 

                         59618181120 | 56 GB          | logging 

                         58586382336 | 55 GB          | apm_evidence 

              (10 rows)

            
          16.3
          postgres@[local]/dynacom=#
                select pg_total_relation_size(oid),
                pg_size_pretty(pg_total_relation_size(oid)),
                oid::regclass from pg_class where relkind in ('r'

              ,'p','m') order by 1 DESC LIMIT 10 OFFSET 2;  

              pg_total_relation_size | pg_size_pretty |          oid
                        

------------------------+----------------+----------------------- 

                        542910595072 | 506 GB         | fb_reports_files
              

                        515205160960 | 480 GB         | vmanews 

                        491888680960 | 458 GB         | mail_entity 

                        115919945728 | 108 GB         |
              approval_request_docs 

                         82050457600 | 76 GB          | items_atts 

                         80889028608 | 75 GB          | items_atts_tmp 

                         64453533696 | 60 GB          | fb_reports_dets
              

                         57055133696 | 53 GB          | apm_evidence 

                         55286161408 | 51 GB          | logging 

                         52352860160 | 49 GB          | status 

              (10 rows)

            
          For what it's worth, I restarted the target (subscriber)
            16.3 DB today after compiling for JIT.  I know 10.23 is dead
            but anyways,  any clues or any advice would be great. 
        
        The two COPY processes on the publisher are CPU active since
        about 11:27:00 EEST in the morning.

      
      I guess the synchronization since this is not based on WAL
        traffic does not keep any track of LSN's, its replication slot
        in the publisher appears dead, here's what I get on the 10.23
        publisher :
      

      postgres@[local]/dynacom=#
            select * from pg_replication_slots where slot_name =
            'data_for_nu_pgsql_dynacom' or slot_name ~* '_sync_'; 

                            slot_name                   |  plugin  |
          slot_type | datoid | database | temporary | active |
          active_pid | xmin | catalog_xmin |  restart_lsn  |
          confirmed_flush_lsn  

-----------------------------------------------+----------+-----------+--------+----------+-----------+--------+------------+------+--------------+---------------+---------------------
          

          pg_54534123_sync_54522027_7374405793436622715 | pgoutput |
          logical   |  16401 | dynacom  | f         | f      |
                     |      |   1478474083 | 4056/C6000060 |
          4056/C6000098 

          pg_54534123_sync_54522019_7374405793436622715 | pgoutput |
          logical   |  16401 | dynacom  | f         | f      |
                     |      |   1478474083 | 4056/C6000028 |
          4056/C6000060 

          data_for_nu_pgsql_dynacom                     | pgoutput |
          logical   |  16401 | dynacom  | f         | t      |
                5530 |      |   1478474084 | 4056/C7014DF0 |
          4056/C8000000 

          (3 rows) 

          
          postgres@[local]/dynacom=# 

          
      those two LSNs ( 4056/C6000060
          , 4056/C6000028 )
        belong to files dating back to 12:30 (maybe created via archive_timeout,
        rather than actual traffic)

      
      postgres@[local]/dynacom=#
            select * from pg_stat_replication where application_name =
            'data_for_nu_pgsql_dynacom' or application_name ~* '_sync_';
          

          pid  | usesysid  | usename |               application_name
                         | client_addr | client_hostname | client_port |
                  backend_start         | backend_xmin |   state   |
            sent_

          lsn    |   write_lsn   |   flush_lsn   |  replay_lsn   |
          write_lag | flush_lag | replay_lag | sync_priority |
          sync_state  

------+-----------+---------+-----------------------------------------------+-------------+-----------------+-------------+-------------------------------+--------------+-----------+--------

-------+---------------+---------------+---------------+-----------+-----------+------------+---------------+------------
          

          5530 | 525054105 | repmgr  | data_for_nu_pgsql_dynacom
                              | 10.9.230.24 |                 |
                47204 | 2024-06-12 11:27:55.919914+03 |              |
          streaming | 4056/C8

          000000 | 4056/C8000000 | 4056/C8000000 | 4056/C8000000 |
                    |           |            |             0 | async 

          5533 | 525054105 | repmgr  |
          pg_54534123_sync_54522019_7374405793436622715 | 10.9.230.24 |
                          |       47228 | 2024-06-12 11:27:55.944498+03
          |   1478474083 | startup   |        

                |               |               |               |
                    |           |            |             0 | async 

          5534 | 525054105 | repmgr  |
          pg_54534123_sync_54522027_7374405793436622715 | 10.9.230.24 |
                          |       47242 | 2024-06-12 11:27:55.95233+03
           |   1478474083 | startup   |        

                |               |               |               |
                    |           |            |             0 | async 

          (3 rows)

          
      On the subscription side I
          get :
      postgres@[local]/dynacom=#
            select * from pg_stat_subscription where subname ~*
            'dynacom'; 

           subid   |          subname          |   pid   | leader_pid |
           relid   | received_lsn  |      last_msg_send_time       |
              last_msg_receipt_time     | latest_end_lsn |
                 latest_end_time         

----------+---------------------------+---------+------------+----------+---------------+-------------------------------+-------------------------------+----------------+-------------------------------
          

          54534123 | data_for_nu_pgsql_dynacom | 3143304 |            |
                   | 4056/C8000000 | 2024-06-12 22:41:23.801094+03 |
          2024-06-12 22:41:57.062591+03 | 4056/C8000000  | 2024-06-12
          22:41:23.801094+03 

          54534123 | data_for_nu_pgsql_dynacom | 3143305 |            |
          54522019 |               | 2024-06-12 11:28:28.43723+03  |
          2024-06-12 11:28:28.43723+03  |                | 2024-06-12
          11:28:28.43723+03 

          54534123 | data_for_nu_pgsql_dynacom | 3143306 |            |
          54522027 |               | 2024-06-12 11:28:28.448293+03 |
          2024-06-12 11:28:28.448293+03 |                | 2024-06-12
          11:28:28.448293+03 

          (3 rows)

          
          both COPY processes seem CPU active on the publisher and also
          the corresponding logical replication workers on the
          subscriber seem active as well.

        
    Hi again,  it seems it ended, with the correct table count for both
    tables. The sync phase runs in one single transaction, it seems  a
    DB (subscriber) restart causes cancellations and the COPY to start
    all over, hence the lost space. I will have to try once again from
    scratch without any interruptions.

    
          -- 
Achilleas Mantzios
 IT DEV - HEAD
 IT DEPT
 Dynacom Tankers Mgmt (as agents only)
        
        -- 
Achilleas Mantzios
 IT DEV - HEAD
 IT DEPT
 Dynacom Tankers Mgmt (as agents only)
      
      -- 
Achilleas Mantzios
 IT DEV - HEAD
 IT DEPT
 Dynacom Tankers Mgmt (as agents only)
    
    -- 
Achilleas Mantzios
 IT DEV - HEAD
 IT DEPT
 Dynacom Tankers Mgmt (as agents only)