On 7/14/22 21:25, Aleš Zelený wrote: > > st 13. 7. 2022 v 2:20 odesílatel Michael Paquier <michael@xxxxxxxxxxx > <mailto:michael@xxxxxxxxxxx>> napsal: > > On Mon, Jul 11, 2022 at 10:50:23AM +0200, Aleš Zelený wrote: > > So far, it has happened three times (during a single week) from > the 14.3 -> > > 14.4 upgrade, before 14.4 we haven't suffered from such an issue. > > > > Questions: > > 1) Can we safely downgrade from 14.4 to 14.3 by shutting down the > instance > > and reinstalling 14.3 PG packages (to prove, that the issue > disappear)? > > 2) What is the best way to diagnose what is the root cause? > > Hmm. 14.4 has nothing in its release notes that would point to a > change in the vacuum or autovacuum's code paths: > https://www.postgresql.org/docs/14/release-14-4.html#id-1.11.6.5.4 > <https://www.postgresql.org/docs/14/release-14-4.html#id-1.11.6.5.4> > > There is nothing specific after a look at the changes as of, and I am > not grabbing anything that would imply a change in memory context > handling either: > `git log --stat REL_14_3..REL_14_4` > `git diff REL_14_3..REL_14_4 -- *.c` > > Saying that, you should be able to downgrade safely as there are no > changes in WAL format or such that would break things. Saying that, > the corruption issue caused by CONCURRENTLY is something you'd still > have to face. > > > Thanks, good to know that, we can use it for a test case, since we > already hit the CONCURRENTLY bug on 14.3. > > > 2022-07-02 14:48:07 CEST [3930]: [3-1] user=,db=,host=,app= > ERROR: out of > > memory > > 2022-07-02 14:48:07 CEST [3930]: [4-1] user=,db=,host=,app= > DETAIL: Failed > > on request of size 152094068 in memory context > "TopTransactionContext". > > 2022-07-02 14:48:07 CEST [3930]: [5-1] user=,db=,host=,app= CONTEXT: > > automatic vacuum of table "prematch.replication.tab_queue_tmp" > > This is the interesting part. Do you happen to use logical > replication in a custom C++ plugin? > > > We are using logical replication to other instances (pg_output) and > decoderbufs > https://github.com/debezium/postgres-decoderbufs > <https://github.com/debezium/postgres-decoderbufs> for other applications. > This is probably just a red herring - std:bad_alloc is what the process that runs into the overcommit limit gets. But the real issue (e.g. memory leak) is likely somewhere else - different part of the code, different process ... > ... > > Checking the RssAnon from proc/pid/status I've found some points where > RssAnon memory usage grew very steep for a minute, but no "suspicious" > queries/arguments were found in the instance logfile. > > Any hint, on how to get the root cause would be appreciated since so far > I've failed to isolate the issue reproducible testcase. > At least I hope that looking for the RssAnon process memory is an > appropriate metric, if not, let me know and I'll try to update the > monitoring to get the root cause. > > I can imagine a workaround with client application regular reconnect..., > but u to 14.3 it works, so I'd like to fix the issue either on our > application side or at PG side if it is a PG problem. > I think it's be interesting to get memory context stats from the processes consuming a lot of memory. If you know which processes are suspect (and it seems you know, bacause if a reconnect helps it's the backend handling the connection), you can attach a debugger and do $ gdb -p $PID call MemoryContextStats(TopMemoryContext) which will log info about memory contexts, just like autovacuum. Hopefully that tells us memory context is bloated, and that might point us to particular part of the code. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company