Heyo,
I'm attempting to use logical decoding with the streaming replication protocol to perform change-data-capture on PostgreSQL 9.5.4. I'm seeing the replication stream "stall" for long periods of time where the walsender process will be pinned at 100% CPU utilization, but no data is being sent to my client.
The stalls occur unpredictably on my production system, but generally seem to be correlated with schema operations. My source database has about 100,000 tables; it's a one-schema-per-tenant multi-tenant SaaS system.
I've reproduced the same symptoms with two different approaches on my local machine. With both, I have a replication client connected via the streaming protocol.
In reproduction approach 1, I've created a thread that inserts small sets of data, and a thread that creates a schema w/ 500 tables and then drops it. This approach has pinned CPU usage, but data does come out of it -- just excruciatingly slow when compared to the same test without the schema create & drop.
In reproduction approach 2, I've created a database w/ 100,000 tables on it and performed a "vacuum ful". The walsender goes to 100% CPU and no data comes out of the replication stream for hours.
I've performed a CPU sampling with the OSX `sample` tool based upon reproduction approach #1: https://gist.github.com/mfenniak/366d7ed19b2d804f41180572dc1600d8 It appears that most of the time is spent in the RelfilenodeMapInvalidateCallback and CatalogCacheIdInvalidate cache invalidation callbacks, both of which appear to be invalidating caches based upon the cache value.
Has anyone else run into this kind of performance problem? Any thoughts on how it might be resolved? I don't mind putting in the work if someone could describe what is happening here, and have a discussion with me about what kind of changes might be necessary to improve the performance.
Thanks all,
Replicon | The leader in cloud time tracking applications - 7,800+ Customers - 70+ Countries - 1.5 Million Users
www.replicon.com | facebook |
We are hiring! | search jobs