Re: Logical replication from PostgreSQL 10.23 to 16.3, last two big tables (getting synced) just keep growing

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Στις 12/6/24 22:46, ο/η Achilleas Mantzios έγραψε:
Στις 12/6/24 21:09, ο/η Achilleas Mantzios έγραψε:
Στις 12/6/24 20:14, ο/η Achilleas Mantzios έγραψε:

Hello

as the subject says, we are trying to test a migration plan from 10.23 to 16.3, using logical replication to new OS, + checksums  + ICU + SQL_ASCII -> UTF-8, (some of which proved to be a challenge).  After solving all of the problems, and repeating the process, now it seems the system got into a weird state, in which all of the ~ 930 tables got synced correctly , except the two largest ones. The DB in total (in its 10.23 version) takes up 4.6 TB. In 16.3 it has already taken 4.8 and keep growing. No ERRORs or ROLLBACKS that would justify the growth (to my limited knowledge).

Those two tables keep growing and growing in the 16.3 (subscriber) DB.

In 10.23 (publisher)

postgres@[local]/dynacom=# select pg_total_relation_size(oid), pg_size_pretty(pg_total_relation_size(oid)), oid::regclass from pg_class where  oid in ('marin
erpapers_atts'::regclass,'marinerpapers_atts_tmp'::regclass);
pg_total_relation_size | pg_size_pretty |          oid            
------------------------+----------------+------------------------
         1082956537856 | 1009 GB        | marinerpapers_atts
          973440409600 | 907 GB         | marinerpapers_atts_tmp
(2 rows)

postgres@[local]/dynacom=#

In 16.3 (subscriber)

postgres@[local]/dynacom=# select pg_total_relation_size(oid), pg_size_pretty(pg_total_relation_size(oid)), oid::regclass from pg_class where  oid in ('marin
erpapers_atts'::regclass,'marinerpapers_atts_tmp'::regclass);
pg_total_relation_size | pg_size_pretty |          oid            
------------------------+----------------+------------------------
         1301943779328 | 1213 GB        | marinerpapers_atts
         1295188762624 | 1206 GB        | marinerpapers_atts_tmp
(2 rows)

 The weird thing is that for every single table of the rest of the big ones, the size on the new machine seems smaller :

10.23 :

postgres@[local]/dynacom=# select pg_total_relation_size(oid), pg_size_pretty(pg_total_relation_size(oid)), oid::regclass from pg_class where relkind in ('r'
,'p','m') order by 1 DESC LIMIT 10 OFFSET 2;  
pg_total_relation_size | pg_size_pretty |          oid           
------------------------+----------------+-----------------------
          634434142208 | 591 GB         | mail_entity
          568166334464 | 529 GB         | fb_reports_files
          538761527296 | 502 GB         | vmanews
          115932397568 | 108 GB         | approval_request_docs
          107413241856 | 100 GB         | status
           86767763456 | 81 GB          | fb_reports_dets
           82120941568 | 76 GB          | items_atts
           81043136512 | 75 GB          | items_atts_tmp
           59618181120 | 56 GB          | logging
           58586382336 | 55 GB          | apm_evidence
(10 rows)

16.3

postgres@[local]/dynacom=# select pg_total_relation_size(oid), pg_size_pretty(pg_total_relation_size(oid)), oid::regclass from pg_class where relkind in ('r'
,'p','m') order by 1 DESC LIMIT 10 OFFSET 2;  
pg_total_relation_size | pg_size_pretty |          oid           
------------------------+----------------+-----------------------
          542910595072 | 506 GB         | fb_reports_files
          515205160960 | 480 GB         | vmanews
          491888680960 | 458 GB         | mail_entity
          115919945728 | 108 GB         | approval_request_docs
           82050457600 | 76 GB          | items_atts
           80889028608 | 75 GB          | items_atts_tmp
           64453533696 | 60 GB          | fb_reports_dets
           57055133696 | 53 GB          | apm_evidence
           55286161408 | 51 GB          | logging
           52352860160 | 49 GB          | status
(10 rows)

For what it's worth, I restarted the target (subscriber) 16.3 DB today after compiling for JIT.  I know 10.23 is dead but anyways,  any clues or any advice would be great.

The two COPY processes on the publisher are CPU active since about 11:27:00 EEST in the morning.

I guess the synchronization since this is not based on WAL traffic does not keep any track of LSN's, its replication slot in the publisher appears dead, here's what I get on the 10.23 publisher :


postgres@[local]/dynacom=# select * from pg_replication_slots where slot_name = 'data_for_nu_pgsql_dynacom' or slot_name ~* '_sync_';
                  slot_name                   |  plugin  | slot_type | datoid | database | temporary | active | active_pid | xmin | catalog_xmin |  restart_lsn  | confirmed_flush_lsn  
-----------------------------------------------+----------+-----------+--------+----------+-----------+--------+------------+------+--------------+---------------+---------------------
pg_54534123_sync_54522027_7374405793436622715 | pgoutput | logical   |  16401 | dynacom  | f         | f      |            |      |   1478474083 | 4056/C6000060 | 4056/C6000098
pg_54534123_sync_54522019_7374405793436622715 | pgoutput | logical   |  16401 | dynacom  | f         | f      |            |      |   1478474083 | 4056/C6000028 | 4056/C6000060
data_for_nu_pgsql_dynacom                     | pgoutput | logical   |  16401 | dynacom  | f         | t      |       5530 |      |   1478474084 | 4056/C7014DF0 | 4056/C8000000
(3 rows)

postgres@[local]/dynacom=#

those two LSNs ( 4056/C6000060 , 4056/C6000028 ) belong to files dating back to 12:30 (maybe created via archive_timeout, rather than actual traffic)

postgres@[local]/dynacom=# select * from pg_stat_replication where application_name = 'data_for_nu_pgsql_dynacom' or application_name ~* '_sync_';
pid  | usesysid  | usename |               application_name                | client_addr | client_hostname | client_port |         backend_start         | backend_xmin |   state   |   sent_
lsn    |   write_lsn   |   flush_lsn   |  replay_lsn   | write_lag | flush_lag | replay_lag | sync_priority | sync_state  
------+-----------+---------+-----------------------------------------------+-------------+-----------------+-------------+-------------------------------+--------------+-----------+--------
-------+---------------+---------------+---------------+-----------+-----------+------------+---------------+------------
5530 | 525054105 | repmgr  | data_for_nu_pgsql_dynacom                     | 10.9.230.24 |                 |       47204 | 2024-06-12 11:27:55.919914+03 |              | streaming | 4056/C8
000000 | 4056/C8000000 | 4056/C8000000 | 4056/C8000000 |           |           |            |             0 | async
5533 | 525054105 | repmgr  | pg_54534123_sync_54522019_7374405793436622715 | 10.9.230.24 |                 |       47228 | 2024-06-12 11:27:55.944498+03 |   1478474083 | startup   |        
      |               |               |               |           |           |            |             0 | async
5534 | 525054105 | repmgr  | pg_54534123_sync_54522027_7374405793436622715 | 10.9.230.24 |                 |       47242 | 2024-06-12 11:27:55.95233+03  |   1478474083 | startup   |        
      |               |               |               |           |           |            |             0 | async
(3 rows)

On the subscription side I get :

postgres@[local]/dynacom=# select * from pg_stat_subscription where subname ~* 'dynacom';
 subid   |          subname          |   pid   | leader_pid |  relid   | received_lsn  |      last_msg_send_time       |     last_msg_receipt_time     | latest_end_lsn |        latest_end_time         
----------+---------------------------+---------+------------+----------+---------------+-------------------------------+-------------------------------+----------------+-------------------------------
54534123 | data_for_nu_pgsql_dynacom | 3143304 |            |          | 4056/C8000000 | 2024-06-12 22:41:23.801094+03 | 2024-06-12 22:41:57.062591+03 | 4056/C8000000  | 2024-06-12 22:41:23.801094+03
54534123 | data_for_nu_pgsql_dynacom | 3143305 |            | 54522019 |               | 2024-06-12 11:28:28.43723+03  | 2024-06-12 11:28:28.43723+03  |                | 2024-06-12 11:28:28.43723+03
54534123 | data_for_nu_pgsql_dynacom | 3143306 |            | 54522027 |               | 2024-06-12 11:28:28.448293+03 | 2024-06-12 11:28:28.448293+03 |                | 2024-06-12 11:28:28.448293+03
(3 rows)

both COPY processes seem CPU active on the publisher and also the corresponding logical replication workers on the subscriber seem active as well.

Hi again,  it seems it ended, with the correct table count for both tables. The sync phase runs in one single transaction, it seems  a DB (subscriber) restart causes cancellations and the COPY to start all over, hence the lost space. I will have to try once again from scratch without any interruptions.

-- 
Achilleas Mantzios
 IT DEV - HEAD
 IT DEPT
 Dynacom Tankers Mgmt (as agents only)
-- 
Achilleas Mantzios
 IT DEV - HEAD
 IT DEPT
 Dynacom Tankers Mgmt (as agents only)
-- 
Achilleas Mantzios
 IT DEV - HEAD
 IT DEPT
 Dynacom Tankers Mgmt (as agents only)
-- 
Achilleas Mantzios
 IT DEV - HEAD
 IT DEPT
 Dynacom Tankers Mgmt (as agents only)

[Index of Archives]     [Postgresql Home]     [Postgresql General]     [Postgresql Performance]     [Postgresql PHP]     [Postgresql Jobs]     [PHP Users]     [PHP Databases]     [PHP Home]     [PHP on Windows]     [Kernel Newbies]     [PHP Classes]     [PHP Databases]     [Yosemite Forum]

  Powered by Linux