Hi, here's an update on this old thread.
I've found the problem, it was because of long transactions.
I reproduced the problem when I added a second publication and another postgres cluster with a subscription to it (on acceptance still).
This query came in handy to show me what was happening:
select r.pid, a.state as query_state, a.wait_event, r.application_name, r.state as replication_state,
-- now()-a.query_start as query_time,
r.write_lag, pg_size_pretty(pg_wal_lsn_diff(r.sent_lsn,r.write_lsn)) as write_lag2,
r.flush_lag, pg_size_pretty(pg_wal_lsn_diff(r.sent_lsn,r.flush_lsn)) as flush_lag2,
r.replay_lag, pg_size_pretty(pg_wal_lsn_diff(r.sent_lsn,r.replay_lsn)) as replay_lag2,
r.client_addr, a.backend_start, a.state_change, a.query_start, now()
from pg_stat_replication r
left join pg_stat_activity a on a.pid = r.pid
order by r.application_name;
-- now()-a.query_start as query_time,
r.write_lag, pg_size_pretty(pg_wal_lsn_diff(r.sent_lsn,r.write_lsn)) as write_lag2,
r.flush_lag, pg_size_pretty(pg_wal_lsn_diff(r.sent_lsn,r.flush_lsn)) as flush_lag2,
r.replay_lag, pg_size_pretty(pg_wal_lsn_diff(r.sent_lsn,r.replay_lsn)) as replay_lag2,
r.client_addr, a.backend_start, a.state_change, a.query_start, now()
from pg_stat_replication r
left join pg_stat_activity a on a.pid = r.pid
order by r.application_name;
What i saw was that replication wold "start", then enter a "catchup" state and detect about 5GB of data, wich it would then process, so i would see the "lag_..." numbers decline as i repeated that query.
When the number hit 0 or a slightly negative number, it would take a minute (literally) and then the process was repeated. BTW I was seeing negative numbers at some point so i swapped the values (e.g. r.write_lsn, r.sent_lsn) and changed them back later. So I think that the 5GB was actually -5GB, strangely. During the minute wait at a near-zero lag, the query state would be "ReorderBufferWrite".
In the log there was an important extra line that I hadn't noticed before, because it doesn't occur as often as all the lines about the workers exiting in an error.
2021-07-14 14:04:58.110 CEST [22782] ERROR: terminating logical replication worker due to timeout
2021-07-14 14:04:58.112 CEST [3720] LOG: background worker "logical replication worker" (PID 22782) exited with exit code 1
I was already suspecting long transactions to be the problem, so I thought that the timeout might be the problem.
At that point I changed the wal_receiver_timeout option to 0 in the postgresql.conf file and the above behaviour ceased.
Instead I saw the lag values increase rapidly, with about 100GB per hour (not a high traffic database).
There were long transactions on the subscriber database, but merely killing those precesses wasn't enough because they would soon be started again.
I had to disable the job that started them for about 15 minutes, that did the job. After that, those long transactions were not a problem anymore.
A guess at what is causing this:
The long transactions on the subscriber node are only a problem during the initial snapshot phase. It uses a transaction and needs heavy locks on the tables that are to be replicated.
The initial snapshot was cancelled because it didn't end successfully within the wal_receiver_timeout, so it was restarted later.
When i disabled the timeout, all the WAL on te server was retained. There were still periodical retries, there was no pending request for a lock. Or it would have gotten one the moment I killed that long transaction.
But every time that the snapshot process was retried, there was already a new long transaction preventing the lock for the snapshot.
I had to prevent long transactions for long enough so that the initial snapshot could get a lock on the subscriber. However, I don't know the length of the timeout that defines how long a transaction can be without disturbing the snapshot.
Please correct me where I'm wrong.
Cheers,
Willy Loos
On Fri, Jun 11, 2021 at 9:51 AM Willy-Bas Loos <willybas@xxxxxxxxx> wrote:
Hi, I was going to follow up on this one, sorry for the long silence.The replication is working fine now, and I have no idea what the problem was. Not cool.If I find out, I will let you know.On Mon, May 31, 2021 at 6:06 PM Tomas Pospisek <tpo2@xxxxxxxxxxxxx> wrote:Hi Willy-Bas Loos,
On 31.05.21 17:32, Willy-Bas Loos wrote:
>
>
> On Mon, May 31, 2021 at 4:24 PM Vijaykumar Jain
> <vijaykumarjain.github@xxxxxxxxx
> <mailto:vijaykumarjain.github@xxxxxxxxx>> wrote:
>
> So I got it all wrong it seems :)
>
> Thank you for taking the time to help me!
>
> You upgraded to pg13 fine? , but while on pg13 you have issues with
> logical replication ?
>
> Yes, the upgrade went fine. So here are some details:
> I already had londiste running on postgres 9.3, but londiste wouldn't
> run on Debian 10
> So i first made the new server Debian 9 with postgres 9.6 and i started
> replicating with londiste from 9.3 to 9.6
> When all was ready, i stopped the replication to the 9.6 server and
> deleted all londiste & pgq content with drop schema cascade.
> Then I upgraded the server to Debian 10. Then i user pg_upgrade to
> upgrade from postgres 9.6 to 13. (PostGIS versions were kept compatible).
> Then I added logical replication and a third server as a subscriber.
>
> I was going to write that replication is working fine (since the table
> contains a lot of data and there are no conflicts in the log), but it
> turns out that it isn't.
> The subscriber is behind and It looks like there hasn't been any
> incoming data after the initial data synchronization.
> So at least now i know that the WAL is being retained with a reason. The
> connection is working properly (via psql anyway)
I once maybe had a similar problem due to some ports that were needed
for replication being firewalled off or respectively the master having
the wrong IP address of the old master (now standby server) or such.
There was absolutely no word anywhere in any log about the problem I was
just seeing the new postgres master not starting up after hours and
hours of waiting after a failover. I somehow found out about the
required port being blocked (I don't remember - maybe seing the
unanswered SYNs in tcpdump? Or via ufw log entries?).
> I will also look into how to diagnose this from the system tables, e.g.
> substracting LSN's to get some quantitative measure for the lag.
>
>
>
> There is a path in the postgresql source user subscription folder
> iirc which covers various logical replication scenarios.
> That may help you just in case.
>
> OK, so comments in the source code you mean?
>
--Willy-Bas Loos
--
Willy-Bas Loos