Understanding ANALYZE memory usage with "big" tsvector columns

Jérôme Augé <jerome.auge@xxxxxxxxxxx> · Thu, 4 Feb 2016 10:12:48 +0100

Hi,

I'm trying to understand how to estimate and minimize memory
consumption of ANALYZE operations on "big" tsvector columns.

Context:

Postgresql server is 9.1.19 (Ubuntu package 9.1.19-0ubuntu0.12.04).

I have a database on which I noticed that autovacuum operations could
consume up to 2 GB of resident memory (observed with top's RSS
column).

This is sometime problematic because with 3 autovacuum processes
(default value on Ubuntu), this leads to a peak usage of 6 GB, and
sends the server into swapin/swapout madness.

This typically happens during restoration of dumps or massive updates
in the database, which triggers the autovacuum processes, and slows
down the server during the execution of these operations due to the
swapin/swapout.

Up to today we addressed this behavior by either disabling autovacuum
or temporarily bumping the VM's memory limit for the duration of the
operation.

Now, I think I managed to replicate and isolate the problem.

My analysis:

I have a table with ~20k tuples, and 350 columns with type int, text
and tsvector.

I created a copy of this table and iteratively dropped some columns to
see if a specific column was the cause of this spike in memory usage.
And I came to the simple case of a table with a single tsvector column
that causes ANALYZE to consume up to 2 GB or memory.

So, this table has a single column of type tsvector, and this column
is quite big because as it is originally the concatenation of all the
other tsvector columns from the table (and this tsvector columns also
has a GIST index).

Here is the top 10 length for this column :

--8<--
# SELECT length(fulltext) FROM test ORDER BY length DESC LIMIT 10;
length
--------
87449
87449
87401
87272
87261
87259
87259
87259
87257
87257
(10 rows)
-->8--

I tried playing with "default_statistics_target" (which is set to
100): if I reduce it to 5, then the ANALYZE is almost immediate and
consumes less than ~200 MB. At 10, the process starts to consume up to
~440 MB.

I see no difference in Postgresql's planning selection between
"default_statistics_target" 1 and 100: EXPLAIN ANALYZE shows the same
plan being executed using the GIST index (for a simple "SELECT
count(ctid) FROM test WHERE fulltext @@ 'Hello'").

So:
- Is there a way to estimate or reduce ANALYZE's peak memory usage on
this kind of tables?
- Is it "safe" to set STATISTICS = 1 on this particular "big" tsvector
columns? Or could it have an adverse effect on query plan selection?

I'm currently in the process of upgrading to Postgresql 9.5, so I'll
see if the behavior changes or not on this version.

Thanks,
Jérôme

-- 
Sent via pgsql-performance mailing list (pgsql-performance@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance