Hello,
we are using logical replication from PostgreSQL 10.8 (Ubuntu 10.8-1.pgdg14.04+1) to PostgreSQL 11.3 (Ubuntu 11.3-1.pgdg18.04+1). Thre are 1305 replicated tables within 3 schemas. The tables are from small almost static lookup tables to larger tables up to ~40GB holding 44 millions of rows. In production, the replication works like a charm.
But we are experiencing an issue when we create clones of production databases to the development environment. Development is in a separated network, loaded from a testing data feed therefore development publisher database contains data in almost identical volumes and the same structure, but it is being continuously updated from the testing data feed.
To let all test work properly, replicated tables are truncated after subscriber database is cloned to the development environment (all restored subscriptions are dropped) and a new subscription to development publisher database is created with copy_data = true option to ensure data consistency.
Such an approach works well for more than a year using 2 of the present 3 schemas (most of the above-mentioned table count is in the long-used two schemas). While we have added the third schema with only 7 tables from 35 MB to 40 GB in size, we are experiencing an initial synchronization issue. The total volume of the three schemas is several hundreds of GB, so the initial table synchronization lasts usually for several hours. surprisingly, all of the 7 tables in the third schema, despite their size or row count fail due to primary key violation, like - duplicate key value violates unique constraint "party_pkey".
Such behavior makes me very surprised - based on the docs: https://www.postgresql.org/docs/11/logical-replication-architecture.html, 31.5.1. Initial Snapshot, I expect that after some time necessary for the table copy and afterward applied changes from WAL, there can't be a chance for duplicate rows, since both the publisher and the subscriber are using the same primary key as replica identity.
LOG: logical replication table synchronization worker for subscription "sub_anl_dsos_usd", table "party" has started
ERROR: duplicate key value violates unique constraint "party_pkey"
DETAIL: Key (party_id)=(\x0003dd00131add0a) already exists.
CONTEXT: COPY party, line 1
ERROR: duplicate key value violates unique constraint "party_pkey"
DETAIL: Key (party_id)=(\x0003dd00131add0a) already exists.
CONTEXT: COPY party, line 1
If, once this happens (the table was truncated before the subscription was created) I truncate the table on the subscriber, copy succeeded and from that moment replication works like a charm. I'm not aware of using replication different way then than for the two schemas we stared with and for them, the synchronization including the application of changes subsequent to subscription creation works without any issues.
Thanks for any hints on how to diagnose this issue, I can't find where I did a mistake and I did not expect that this might be related to higher change rate on publisher tables than the other tables.
Ales