Search Postgresql Archives

Re: Help diagnosing replication (copy) error

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 3/8/24 14:50, Steve Baldwin wrote:

Hi,

I'm in the process of migrating a cluster from 15.3 to 16.2. We have a 'zero downtime' requirement so I'm using logical replication to create the new cluster and then perform the switch in the application.

I have a situation where all but one table have done their initial copy. The remaining table is the largest (of course), and the replication slot that is assigned for the copy (pg_378075177_sync_60067_7343845372910323059) is showing as 'active=false' if I select from pg_replication_slots on the publisher.

I've checked the recent logs for both the publishing cluster and the subscribing cluster but I can't see any replication errors. I guess I could have missed them, but it doesn't seem like anything is being 'retried' like I've seen in the past with replication errors.

I've used this mechanism for zero-downtime upgrades multiple times in the past, and have recently used it to upgrade smaller clusters from 15.x to 16.2 without issue.

The clusters are hosted on AWS RDS, so I have no access to the servers, but if that's the only way to diagnose the issue, I can create a support case.

Does anyone have any suggestions as to where I should look for the issue?

Thanks,

Steve

In our setup we're logically replicating a 450G database hosted on real hardware to an RDS instance.

Multiple times we've had replication simply stop and we could never find any reason for that on either publisher or subscriber.

The *only* solution that ever worked in these cases was dropping the subscription in RDS and re-creating it with (copy_data = false).

At that point replication picks right up again for new transactions *but* at the expense of losing all of the WAL that should have been replicated during the outage.  I wrote a python based "logical replication fixer" to fill in those gaps.

Given that the subscriber is the one that initiates the connection to the publisher and that as soon as the subscription is dropped and restarted replication resumes my hunch is that this is squarely on RDS.  With both publisher and subscriber on RDS as in your case YMMV.

RDS is a black box--who knows what's really going on there?  It would be interesting to see what the response is after you open a support case.  I hope you'll be able to share that with the list.

Jeff









[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Index of Archives]     [Postgresql Jobs]     [Postgresql Admin]     [Postgresql Performance]     [Linux Clusters]     [PHP Home]     [PHP on Windows]     [Kernel Newbies]     [PHP Classes]     [PHP Databases]     [Postgresql & PHP]     [Yosemite]

  Powered by Linux