Hi,
We migrated from postgres 11 to 12 using logical replication (over local network). Today
we noticed that one table is missing 1252 rows after the replication
finished and we flipped to the new primary (we still have the old master database so we
can recover).
We see that these rows were
inserted in the table after starting the initial copy of the table. Most
of the missing rows seem from new inserts happening **during the
initial copy** (1230) and the rest (22) from inserts **during the period
the replication ran** (7 days).
After further investigation unfortunately more tables have missing rows, all of them are after the initial table copy phase. We took a per-table approach for the replication, starting with creating an empty publication and adding tables via
ALTER PUBLICATION pg12_migration ADD TABLE FOOAfter that we refreshed the publication on the "new postgres 12 primary" using
ALTER SUBSCRIPTION pg12_migration REFRESH PUBLICATION;
We only added new tables after the the initial copy of the previous was done (the internal state was replicating).
We never stopped the subscriptions during all this and we started with a fresh schema.
We
did some sanity checks before we switched to the new master, like
comparing max(id) to see if the replica was up to date (including this
table) and counts on some smaller tables and that all checked out okay, we never thought of missing rows somewhere in between....
So
how can this happen?
Lars