Re: Bad planning data resulting in OOM killing of postgres

Jeff Janes <jeff.janes@xxxxxxxxx> · Mon, 13 Feb 2017 11:26:17 -0800

On Mon, Feb 13, 2017 at 9:40 AM, David Hinkle <hinkle@xxxxxxxxxxxxxx> wrote:
I'm having trouble with purges related to a large table.   The delete

query consumes ram until postgres crashes due to OOM.   I have a very

large table called log_raw.  There are half a dozen related tables,

such as 'urls' and 'titles'.   log_raw.urlid = urls.urlid and urls

contains the text of the various urls, for example.

Each time I try to purge these side tables the unit OOM's.

psql:postgres@cipafilter = explain DELETE FROM titles WHERE NOT EXISTS

( SELECT 1 FROM log_raw WHERE log_raw.titleid = titles.titleid );

                                       QUERY PLAN

─────────────────────────────────────────────────────────────────────────────────────────

 Delete on titles  (cost=22166473.44..24850954.67 rows=870382 width=12)

   ->  Hash Anti Join  (cost=22166473.44..24850954.67 rows=870382 width=12)

         Hash Cond: (titles.titleid = log_raw.titleid)

         ->  Seq Scan on titles  (cost=0.00..17871.64 rows=870664 width=10)

         ->  Hash  (cost=12744792.64..12744792.64 rows=542011264 width=10)

               ->  Seq Scan on log_raw  (cost=0.00..12744792.64

rows=542011264 width=10)

(6 rows)

psql:postgres@cipafilter = select count(*) from (select titleid from

log_raw group by titleid) as a;

 count

────────

 872210

(1 row)

cipafilter=# select n_distinct from pg_stats where tablename =

'log_raw' and attname = 'titleid';

 n_distinct

------------

        282

(1 row)

The planning data is wildly low for each of these fields, and I wonder

if because of that error the planner thinks it can keep all these id's

in ram while it works. Analyze doesn't fix it.   Increasing the

statistics target improves the data in n_distinct but not

considerably, as increasing it 3 or 4 fold leads to it still being

wildly off.  ALTER TABLE set n_distinct doesn't seem to be used by the

planner as it doesn't change any of the plans I've generated or seem

to be taken into account in the row estimates. I'm out of ideas.

Anybody have any ideas?

Your data on log_raw.titleid is probably clustered, so that any given page of the table all has the same value for titleid. This really messes up the sampling algorithm used by ANALYZE.  To overcome that, you would have to increase the statistics target by 3 or 4 orders of magnitude, not a factor of 3 or 4.

However, that doesn't seem to be the actual problem.  Surprisingly enough, a hash anti-join doesn't automatically de-duplicate the hash table as it is being built.  So n_distinct correctly does not have an influence on the estimated RAM usage, because it doesn't influence the actual ram usage either.

It sounds like your work_mem is set way too high.  What is it set to?  And what version of PostgreSQL are you using?

Cheers,

Jeff