st 13. 7. 2022 v 2:20 odesílatel Michael Paquier <michael@xxxxxxxxxxx> napsal:
On Mon, Jul 11, 2022 at 10:50:23AM +0200, Aleš Zelený wrote:
> So far, it has happened three times (during a single week) from the 14.3 ->
> 14.4 upgrade, before 14.4 we haven't suffered from such an issue.
>
> Questions:
> 1) Can we safely downgrade from 14.4 to 14.3 by shutting down the instance
> and reinstalling 14.3 PG packages (to prove, that the issue disappear)?
> 2) What is the best way to diagnose what is the root cause?
Hmm. 14.4 has nothing in its release notes that would point to a
change in the vacuum or autovacuum's code paths:
https://www.postgresql.org/docs/14/release-14-4.html#id-1.11.6.5.4
There is nothing specific after a look at the changes as of, and I am
not grabbing anything that would imply a change in memory context
handling either:
`git log --stat REL_14_3..REL_14_4`
`git diff REL_14_3..REL_14_4 -- *.c`
Saying that, you should be able to downgrade safely as there are no
changes in WAL format or such that would break things. Saying that,
the corruption issue caused by CONCURRENTLY is something you'd still
have to face.
Thanks, good to know that, we can use it for a test case, since we already hit the CONCURRENTLY bug on 14.3.
> 2022-07-02 14:48:07 CEST [3930]: [3-1] user=,db=,host=,app= ERROR: out of
> memory
> 2022-07-02 14:48:07 CEST [3930]: [4-1] user=,db=,host=,app= DETAIL: Failed
> on request of size 152094068 in memory context "TopTransactionContext".
> 2022-07-02 14:48:07 CEST [3930]: [5-1] user=,db=,host=,app= CONTEXT:
> automatic vacuum of table "prematch.replication.tab_queue_tmp"
This is the interesting part. Do you happen to use logical
replication in a custom C++ plugin?
We are using logical replication to other instances (pg_output) and decoderbufs
https://github.com/debezium/postgres-decoderbufs for other applications.
> 2022-07-02 14:48:47 CEST [4476]: [43-1] user=,db=,host=,app= LOG: could
> not fork worker process: Cannot allocate memory
> terminate called after throwing an instance of 'std::bad_alloc'
> what(): std::bad_alloc
>
> DETAIL: parameters: $1 = '1', $2 = '1748010445', $3 = '0', $4 = '1000'
> terminate
> called after throwing an instance of 'std::bad_alloc' terminate called
> after throwing an instance of 'std::bad_alloc' what(): what():
> std::bad_allocstd::bad_alloc 2022-07-08 14:54:23 CEST [4476]: [49-1]
> user=,db=,host=,app= LOG: background worker "parallel worker" (PID 25251)
> was terminated by signal 6: Aborted
> 2022-07-08 14:54:23 CEST [4476]: [51-1] user=,db=,host=,app= LOG:
> terminating any other active server processes
Looks like something is going wrong in the memory handling of one of
your C++ extensions here. If you can isolate an issue using a query
without any custom code, that would be a Postgres problem, but I think
that you are missing a trick in it.
Here are extensions installed in the database served by the cluster:
prematch=# \dx
List of installed extensions
Name | Version | Schema | Description
--------------------+---------+------------+------------------------------------------------------------------------
amcheck | 1.3 | public | functions for verifying relation integrity
dblink | 1.2 | public | connect to other PostgreSQL databases from within a database
file_fdw | 1.0 | public | foreign-data wrapper for flat file access
hstore | 1.8 | public | data type for storing sets of (key, value) pairs
hypopg | 1.3.1 | public | Hypothetical indexes for PostgreSQL
pageinspect | 1.9 | public | inspect the contents of database pages at a low level
pg_buffercache | 1.3 | public | examine the shared buffer cache
pg_stat_kcache | 2.2.0 | public | Kernel statistics gathering
pg_stat_statements | 1.9 | public | track planning and execution statistics of all SQL statements executed
pgcrypto | 1.3 | public | cryptographic functions
pgstattuple | 1.5 | public | show tuple-level statistics
plpgsql | 1.0 | pg_catalog | PL/pgSQL procedural language
plpgsql_check | 2.1 | public | extended check for plpgsql functions
postgres_fdw | 1.1 | public | foreign-data wrapper for remote PostgreSQL servers
tablefunc | 1.0 | public | functions that manipulate whole tables, including crosstab
(15 rows)
List of installed extensions
Name | Version | Schema | Description
--------------------+---------+------------+------------------------------------------------------------------------
amcheck | 1.3 | public | functions for verifying relation integrity
dblink | 1.2 | public | connect to other PostgreSQL databases from within a database
file_fdw | 1.0 | public | foreign-data wrapper for flat file access
hstore | 1.8 | public | data type for storing sets of (key, value) pairs
hypopg | 1.3.1 | public | Hypothetical indexes for PostgreSQL
pageinspect | 1.9 | public | inspect the contents of database pages at a low level
pg_buffercache | 1.3 | public | examine the shared buffer cache
pg_stat_kcache | 2.2.0 | public | Kernel statistics gathering
pg_stat_statements | 1.9 | public | track planning and execution statistics of all SQL statements executed
pgcrypto | 1.3 | public | cryptographic functions
pgstattuple | 1.5 | public | show tuple-level statistics
plpgsql | 1.0 | pg_catalog | PL/pgSQL procedural language
plpgsql_check | 2.1 | public | extended check for plpgsql functions
postgres_fdw | 1.1 | public | foreign-data wrapper for remote PostgreSQL servers
tablefunc | 1.0 | public | functions that manipulate whole tables, including crosstab
(15 rows)
When we started experiencing these issues, based on Prometheus node-exporter statistics, I've begun collecting /proc/<pid>/status for all PG instance processes in an infinite loop.
From this, I can extract some data. It looks like RssAnon memory is what I can see growing from several hundred kB to 1-2GB over some period (depends on workload, but usually 24 hours).
The sessions where RssAnon memory growth are PHP client applications, having multiple connections it might consume 10GB of memory. The same application (I've asked for deployed changes, but no one was released close to the upgrade time).
Comparing Prometheus server data, RssAnon seems to directly follow node_memory_Active_anon_bytes node exporter metrics. If the "suspected" sessions disconnects, node_memory_Active_anon_bytes drops down.
I've checked node_memory_Active_anon_bytes metric before the upgrade and (same application workload) it was pretty stable on PG 13.8, later 14.3 but immediately after the upgrade, node_memory_Active_anon_bytes start growth, and once the Commit limit was reached, the fork()/Out of memory issues appears.
So for the application user running the PHP, I've set log_min_duration_statemenet = 0, but no unexpected statements or arguments (some functions have int array argument) were found in the logfile.
Checking the RssAnon from proc/pid/status I've found some points where RssAnon memory usage grew very steep for a minute, but no "suspicious" queries/arguments were found in the instance logfile.
Any hint, on how to get the root cause would be appreciated since so far I've failed to isolate the issue reproducible testcase.
At least I hope that looking for the RssAnon process memory is an appropriate metric, if not, let me know and I'll try to update the monitoring to get the root cause.
I can imagine a workaround with client application regular reconnect..., but u to 14.3 it works, so I'd like to fix the issue either on our application side or at PG side if it is a PG problem.
Kind regards Ales Zeleny
--
Michael