Dne čt 14. 7. 2022 23:11 uživatel Tomas Vondra <tomas.vondra@xxxxxxxxxxxxxxxx> napsal:
On 7/14/22 21:25, Aleš Zelený wrote:
>
> st 13. 7. 2022 v 2:20 odesílatel Michael Paquier <michael@xxxxxxxxxxx
> <mailto:michael@xxxxxxxxxxx>> napsal:
>
> On Mon, Jul 11, 2022 at 10:50:23AM +0200, Aleš Zelený wrote:
> > So far, it has happened three times (during a single week) from
> the 14.3 ->
> > 14.4 upgrade, before 14.4 we haven't suffered from such an issue.
> >
> > Questions:
> > 1) Can we safely downgrade from 14.4 to 14.3 by shutting down the
> instance
> > and reinstalling 14.3 PG packages (to prove, that the issue
> disappear)?
> > 2) What is the best way to diagnose what is the root cause?
>
> Hmm. 14.4 has nothing in its release notes that would point to a
> change in the vacuum or autovacuum's code paths:
> https://www.postgresql.org/docs/14/release-14-4.html#id-1.11.6.5.4
> <https://www.postgresql.org/docs/14/release-14-4.html#id-1.11.6.5.4>
>
> There is nothing specific after a look at the changes as of, and I am
> not grabbing anything that would imply a change in memory context
> handling either:
> `git log --stat REL_14_3..REL_14_4`
> `git diff REL_14_3..REL_14_4 -- *.c`
>
> Saying that, you should be able to downgrade safely as there are no
> changes in WAL format or such that would break things. Saying that,
> the corruption issue caused by CONCURRENTLY is something you'd still
> have to face.
>
>
> Thanks, good to know that, we can use it for a test case, since we
> already hit the CONCURRENTLY bug on 14.3.
>
> > 2022-07-02 14:48:07 CEST [3930]: [3-1] user=,db=,host=,app=
> ERROR: out of
> > memory
> > 2022-07-02 14:48:07 CEST [3930]: [4-1] user=,db=,host=,app=
> DETAIL: Failed
> > on request of size 152094068 in memory context
> "TopTransactionContext".
> > 2022-07-02 14:48:07 CEST [3930]: [5-1] user=,db=,host=,app= CONTEXT:
> > automatic vacuum of table "prematch.replication.tab_queue_tmp"
>
> This is the interesting part. Do you happen to use logical
> replication in a custom C++ plugin?
>
>
> We are using logical replication to other instances (pg_output) and
> decoderbufs
> https://github.com/debezium/postgres-decoderbufs
> <https://github.com/debezium/postgres-decoderbufs> for other applications.
>
This is probably just a red herring - std:bad_alloc is what the process
that runs into the overcommit limit gets. But the real issue (e.g.
memory leak) is likely somewhere else - different part of the code,
different process ...
> ...
>
> Checking the RssAnon from proc/pid/status I've found some points where
> RssAnon memory usage grew very steep for a minute, but no "suspicious"
> queries/arguments were found in the instance logfile.
>
> Any hint, on how to get the root cause would be appreciated since so far
> I've failed to isolate the issue reproducible testcase.
> At least I hope that looking for the RssAnon process memory is an
> appropriate metric, if not, let me know and I'll try to update the
> monitoring to get the root cause.
>
> I can imagine a workaround with client application regular reconnect...,
> but u to 14.3 it works, so I'd like to fix the issue either on our
> application side or at PG side if it is a PG problem.
>
I think it's be interesting to get memory context stats from the
processes consuming a lot of memory. If you know which processes are
suspect (and it seems you know, bacause if a reconnect helps it's the
backend handling the connection), you can attach a debugger and do
$ gdb -p $PID
call MemoryContextStats(TopMemoryContext)
which will log info about memory contexts, just like autovacuum.
Hopefully that tells us memory context is bloated, and that might point
us to particular part of the code.
If the RssAnon memory is a good indicator, i can then determine the backends and dump memory context.
It'll take me some time since I'm out of office for vacation, but I'll manage that somewhat way.
Thanks for all to the hints!
Aleš
regards
--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company