Hi again, it seems it ended, with the correct table count for both tables. The sync phase runs in one single transaction, it seems a DB (subscriber) restart causes cancellations and the COPY to start all over, hence the lost space. I will have to try once again from scratch without any interruptions.Στις 12/6/24 21:09, ο/η Achilleas Mantzios έγραψε:
Στις 12/6/24 20:14, ο/η Achilleas Mantzios έγραψε:
The two COPY processes on the publisher are CPU active since about 11:27:00 EEST in the morning.Hello
as the subject says, we are trying to test a migration plan from 10.23 to 16.3, using logical replication to new OS, + checksums + ICU + SQL_ASCII -> UTF-8, (some of which proved to be a challenge). After solving all of the problems, and repeating the process, now it seems the system got into a weird state, in which all of the ~ 930 tables got synced correctly , except the two largest ones. The DB in total (in its 10.23 version) takes up 4.6 TB. In 16.3 it has already taken 4.8 and keep growing. No ERRORs or ROLLBACKS that would justify the growth (to my limited knowledge).
Those two tables keep growing and growing in the 16.3 (subscriber) DB.
In 10.23 (publisher)
postgres@[local]/dynacom=# select pg_total_relation_size(oid), pg_size_pretty(pg_total_relation_size(oid)), oid::regclass from pg_class where oid in ('marin
erpapers_atts'::regclass,'marinerpapers_atts_tmp'::regclass);
pg_total_relation_size | pg_size_pretty | oid
------------------------+----------------+------------------------
1082956537856 | 1009 GB | marinerpapers_atts
973440409600 | 907 GB | marinerpapers_atts_tmp
(2 rows)
postgres@[local]/dynacom=#
In 16.3 (subscriber)postgres@[local]/dynacom=# select pg_total_relation_size(oid), pg_size_pretty(pg_total_relation_size(oid)), oid::regclass from pg_class where oid in ('marin
erpapers_atts'::regclass,'marinerpapers_atts_tmp'::regclass);
pg_total_relation_size | pg_size_pretty | oid
------------------------+----------------+------------------------
1301943779328 | 1213 GB | marinerpapers_atts
1295188762624 | 1206 GB | marinerpapers_atts_tmp
(2 rows)The weird thing is that for every single table of the rest of the big ones, the size on the new machine seems smaller :
10.23 :
postgres@[local]/dynacom=# select pg_total_relation_size(oid), pg_size_pretty(pg_total_relation_size(oid)), oid::regclass from pg_class where relkind in ('r'
,'p','m') order by 1 DESC LIMIT 10 OFFSET 2;
pg_total_relation_size | pg_size_pretty | oid
------------------------+----------------+-----------------------
634434142208 | 591 GB | mail_entity
568166334464 | 529 GB | fb_reports_files
538761527296 | 502 GB | vmanews
115932397568 | 108 GB | approval_request_docs
107413241856 | 100 GB | status
86767763456 | 81 GB | fb_reports_dets
82120941568 | 76 GB | items_atts
81043136512 | 75 GB | items_atts_tmp
59618181120 | 56 GB | logging
58586382336 | 55 GB | apm_evidence
(10 rows)
16.3
postgres@[local]/dynacom=# select pg_total_relation_size(oid), pg_size_pretty(pg_total_relation_size(oid)), oid::regclass from pg_class where relkind in ('r'
,'p','m') order by 1 DESC LIMIT 10 OFFSET 2;
pg_total_relation_size | pg_size_pretty | oid
------------------------+----------------+-----------------------
542910595072 | 506 GB | fb_reports_files
515205160960 | 480 GB | vmanews
491888680960 | 458 GB | mail_entity
115919945728 | 108 GB | approval_request_docs
82050457600 | 76 GB | items_atts
80889028608 | 75 GB | items_atts_tmp
64453533696 | 60 GB | fb_reports_dets
57055133696 | 53 GB | apm_evidence
55286161408 | 51 GB | logging
52352860160 | 49 GB | status
(10 rows)
For what it's worth, I restarted the target (subscriber) 16.3 DB today after compiling for JIT. I know 10.23 is dead but anyways, any clues or any advice would be great.
I guess the synchronization since this is not based on WAL traffic does not keep any track of LSN's, its replication slot in the publisher appears dead, here's what I get on the 10.23 publisher :
postgres@[local]/dynacom=# select * from pg_replication_slots where slot_name = 'data_for_nu_pgsql_dynacom' or slot_name ~* '_sync_';
slot_name | plugin | slot_type | datoid | database | temporary | active | active_pid | xmin | catalog_xmin | restart_lsn | confirmed_flush_lsn
-----------------------------------------------+----------+-----------+--------+----------+-----------+--------+------------+------+--------------+---------------+---------------------
pg_54534123_sync_54522027_7374405793436622715 | pgoutput | logical | 16401 | dynacom | f | f | | | 1478474083 | 4056/C6000060 | 4056/C6000098
pg_54534123_sync_54522019_7374405793436622715 | pgoutput | logical | 16401 | dynacom | f | f | | | 1478474083 | 4056/C6000028 | 4056/C6000060
data_for_nu_pgsql_dynacom | pgoutput | logical | 16401 | dynacom | f | t | 5530 | | 1478474084 | 4056/C7014DF0 | 4056/C8000000
(3 rows)
postgres@[local]/dynacom=#
those two LSNs ( 4056/C6000060 , 4056/C6000028 ) belong to files dating back to 12:30 (maybe created via archive_timeout, rather than actual traffic)
postgres@[local]/dynacom=# select * from pg_stat_replication where application_name = 'data_for_nu_pgsql_dynacom' or application_name ~* '_sync_';
pid | usesysid | usename | application_name | client_addr | client_hostname | client_port | backend_start | backend_xmin | state | sent_
lsn | write_lsn | flush_lsn | replay_lsn | write_lag | flush_lag | replay_lag | sync_priority | sync_state
------+-----------+---------+-----------------------------------------------+-------------+-----------------+-------------+-------------------------------+--------------+-----------+--------
-------+---------------+---------------+---------------+-----------+-----------+------------+---------------+------------
5530 | 525054105 | repmgr | data_for_nu_pgsql_dynacom | 10.9.230.24 | | 47204 | 2024-06-12 11:27:55.919914+03 | | streaming | 4056/C8
000000 | 4056/C8000000 | 4056/C8000000 | 4056/C8000000 | | | | 0 | async
5533 | 525054105 | repmgr | pg_54534123_sync_54522019_7374405793436622715 | 10.9.230.24 | | 47228 | 2024-06-12 11:27:55.944498+03 | 1478474083 | startup |
| | | | | | | 0 | async
5534 | 525054105 | repmgr | pg_54534123_sync_54522027_7374405793436622715 | 10.9.230.24 | | 47242 | 2024-06-12 11:27:55.95233+03 | 1478474083 | startup |
| | | | | | | 0 | async
(3 rows)
On the subscription side I get :
postgres@[local]/dynacom=# select * from pg_stat_subscription where subname ~* 'dynacom';
subid | subname | pid | leader_pid | relid | received_lsn | last_msg_send_time | last_msg_receipt_time | latest_end_lsn | latest_end_time
----------+---------------------------+---------+------------+----------+---------------+-------------------------------+-------------------------------+----------------+-------------------------------
54534123 | data_for_nu_pgsql_dynacom | 3143304 | | | 4056/C8000000 | 2024-06-12 22:41:23.801094+03 | 2024-06-12 22:41:57.062591+03 | 4056/C8000000 | 2024-06-12 22:41:23.801094+03
54534123 | data_for_nu_pgsql_dynacom | 3143305 | | 54522019 | | 2024-06-12 11:28:28.43723+03 | 2024-06-12 11:28:28.43723+03 | | 2024-06-12 11:28:28.43723+03
54534123 | data_for_nu_pgsql_dynacom | 3143306 | | 54522027 | | 2024-06-12 11:28:28.448293+03 | 2024-06-12 11:28:28.448293+03 | | 2024-06-12 11:28:28.448293+03
(3 rows)
both COPY processes seem CPU active on the publisher and also the corresponding logical replication workers on the subscriber seem active as well.
-- Achilleas Mantzios IT DEV - HEAD IT DEPT Dynacom Tankers Mgmt (as agents only)-- Achilleas Mantzios IT DEV - HEAD IT DEPT Dynacom Tankers Mgmt (as agents only)-- Achilleas Mantzios IT DEV - HEAD IT DEPT Dynacom Tankers Mgmt (as agents only)
-- Achilleas Mantzios IT DEV - HEAD IT DEPT Dynacom Tankers Mgmt (as agents only)