Re: Logical replication - DDL sub transactions for script executed in single transaction?

Pavel Stehule <pavel.stehule@xxxxxxxxx> · Fri, 22 Mar 2019 11:43:54 +0100

pá 22. 3. 2019 v 10:49 odesílatel Aleš Zelený <zeleny.ales@xxxxxxxxx> napsal:
Hello,

I've learned that logical replication might have performance problem if there are lot of sub transactions within transaction (at least because it enforces spill files in pg_replslot and if there are many - like 80mio, EXT4 did not perform  well - in my case it was caused bu misuse of exception handling in trigger function) and once we get rid of this usage pattern logical replication works like a charm.
Until we have applied schema migration patch creating 700+ tables by flyway, therefore whole file was applied in single transaction (like psql -1 -f xxxx) and it effectively blocked our replication for 22 hours. No failures, wal sender processed did not consumed huge amount of memory (what happened in past wiht 80mi sub-trans caused by trigger), all wal senders did IO very rarely, most of the time they dwell on CPU.

The spill files were not so big:
# du -sh xid-924361123-lsn-FBC-C7000000.snap
224K    xid-924361123-lsn-FBC-C7000000.snap

It takes about 1 minute (all the time 100% CPU used by was sender) to move to following file.

Version, OS:
PostgreSQL 10.6 (Ubuntu 10.6-1.pgdg16.04+1) on x86_64-pc-linux-gnu, compiled by gcc (Ubuntu 5.4.0-6ubuntu1~16.04.10) 5.4.0 20160609, 64-bit 

Perf:
Overhead  Command   Shared Object      Symbol
  57.79%  postgres  postgres           [.] hash_seq_search
  14.70%  postgres  postgres           [.] hash_search_with_hash_value
   5.93%  postgres  postgres           [.] LocalExecuteInvalidationMessage
   3.60%  postgres  postgres           [.] RelfilenodeMapInvalidateCallback
   2.99%  postgres  postgres           [.] hash_uint32
   2.50%  postgres  postgres           [.] hash_search
   1.91%  postgres  postgres           [.] CatCacheInvalidate
   1.83%  postgres  postgres           [.] CallSyscacheCallbacks
   1.37%  postgres  postgres           [.] hash_seq_init
   1.24%  postgres  postgres           [.] hash_seq_term
   1.02%  postgres  postgres           [.] RelationCacheInvalidateEntry
   0.97%  postgres  postgres           [.] ReorderBufferCommit
   0.97%  postgres  pgoutput.so        [.] rel_sync_cache_relation_cb
   0.78%  postgres  postgres           [.] uint32_hash
   0.73%  postgres  postgres           [.] PlanCacheRelCallback
   0.60%  postgres  postgres           [.] InvalidateCatalogSnapshot
   0.58%  postgres  postgres           [.] SysCacheInvalidate
   0.35%  postgres  postgres           [.] GetCurrentTransactionNestLevel
   0.13%  postgres  pgoutput.so        [.] _init
   0.00%  postgres  [kernel.kallsyms]  [k] __do_softirq
   0.00%  postgres  [kernel.kallsyms]  [k] _raw_spin_unlock_irqrestore
   0.00%  postgres  postgres           [.] ResourceOwnerEnlargeRelationRefs

My suspect is that psql (of flyway) while processing a file with many create table and grant statements in single transaction mode might create sub transactions, but I haven't found it in docs.

Maybe you have disabled autocommit with ON_ERROR_ROLLBACK on

I don't know about another case, where psql can use subtransactions.

Pavel

Does you have similar experience? What is root cause for this issue?

Workaround is clear - do not create that large amount of DDL in single transaction, but it would be nice to know root cause to eliminate scripts/usage patterns causing this issue generally.

On the other hand, following transaction (100GB of spill files) , almost 150 mio rows deleted (and archived by statement trigger) behaves expected way - wal sender process was reading from disk (something like 130MB/s) and after ~20 minutes replication was in sync, all spill files were removed, so transaction changes volume itself does not seems to be any issue.

Thanks for any advice!

Kind regards Ales Zeleny
P.S.: PostgreSQL log facility is superb, having DDL statements logged and XID in log line prefix I was able to identify the issue source in this case.