On Wed, Dec 9, 2020 at 2:21 AM Lars Vonk <lars.vonk@xxxxxxxxx> wrote:
Hi,We are doing a logical postgres replication from Postgres 11 to 12. Our database is around 700GB (8 cpu's, 32 GB).During the replication process, at some point, we see a huge performance penalty on a particular table. This table acts as a queue with lots of inserts and deletes happening throughout the day. For most of the time this table is empty, but during this performance penalty the number of rows in this table grows to 10.000 rows, and processing is not fast enough to empty this table. Main reason for this (as far as we see) is that the performance of the query for selecting the next row to process drops from < 10MS to 400MS. This eventually causes too much cpu load on the Primary and we have to cancel the replication process.We already tried the initial load three times, and it consistently fails with the same "error". Last try was a per table approach and excluding this "queue" table.After cancelling the replication the query is fast again and the load on the Primary goes back to normal. We see that this happens when replicating large tables (> millions of rows). During this performance penalty the explain of the query selecting the next row from this table tells us it is doing a sequential scan (there is an index but it is not used).- What could cause this performance penalty?- Is this something other people experienced as well during the initial load of a logical replication with large tables?- We are now thinking of temporarily increasing the number of CPU's and RAM for the migration. Would this help in this case?Thanks in advance,Lars
I would guess that you are hitting bad plans (sequential scan instead of index) because auto-vacuum is not picking up this table for vacuum/analyze as often as needed. You could try a periodic 'vacuum analyze table' while the logical replication is going, or just look at your autovacuum settings and tune the system to be more aggressive with turning down scale factor and cost delay. If the table is primarily "insert only" particularly with this initial load of data, then the autovacuum may not pick it up at all. That pain point is mitigated with PG13's new behavior to kick off autovacuum/analyze based on inserts also.