Thanks guys, here's the information you requested: psql:postgres@cipafilter = show work_mem; work_mem ────────── 10MB (1 row) psql:postgres@cipafilter = select version(); version ─────────────────────────────────────────────────────────────────────────────────────────────────────── PostgreSQL 9.4.10 on x86_64-unknown-linux-gnu, compiled by gcc (Ubuntu 4.8.2-19ubuntu1) 4.8.2, 64-bit (1 row) On Mon, Feb 13, 2017 at 1:26 PM, Jeff Janes <jeff.janes@xxxxxxxxx> wrote: > On Mon, Feb 13, 2017 at 9:40 AM, David Hinkle <hinkle@xxxxxxxxxxxxxx> wrote: >> >> I'm having trouble with purges related to a large table. The delete >> query consumes ram until postgres crashes due to OOM. I have a very >> large table called log_raw. There are half a dozen related tables, >> such as 'urls' and 'titles'. log_raw.urlid = urls.urlid and urls >> contains the text of the various urls, for example. >> >> Each time I try to purge these side tables the unit OOM's. >> >> psql:postgres@cipafilter = explain DELETE FROM titles WHERE NOT EXISTS >> ( SELECT 1 FROM log_raw WHERE log_raw.titleid = titles.titleid ); >> QUERY PLAN >> >> ───────────────────────────────────────────────────────────────────────────────────────── >> Delete on titles (cost=22166473.44..24850954.67 rows=870382 width=12) >> -> Hash Anti Join (cost=22166473.44..24850954.67 rows=870382 >> width=12) >> Hash Cond: (titles.titleid = log_raw.titleid) >> -> Seq Scan on titles (cost=0.00..17871.64 rows=870664 >> width=10) >> -> Hash (cost=12744792.64..12744792.64 rows=542011264 width=10) >> -> Seq Scan on log_raw (cost=0.00..12744792.64 >> rows=542011264 width=10) >> (6 rows) >> >> psql:postgres@cipafilter = select count(*) from (select titleid from >> log_raw group by titleid) as a; >> count >> ──────── >> 872210 >> (1 row) >> >> cipafilter=# select n_distinct from pg_stats where tablename = >> 'log_raw' and attname = 'titleid'; >> n_distinct >> ------------ >> 282 >> (1 row) >> >> The planning data is wildly low for each of these fields, and I wonder >> if because of that error the planner thinks it can keep all these id's >> in ram while it works. Analyze doesn't fix it. Increasing the >> statistics target improves the data in n_distinct but not >> considerably, as increasing it 3 or 4 fold leads to it still being >> wildly off. ALTER TABLE set n_distinct doesn't seem to be used by the >> planner as it doesn't change any of the plans I've generated or seem >> to be taken into account in the row estimates. I'm out of ideas. >> Anybody have any ideas? > > > > Your data on log_raw.titleid is probably clustered, so that any given page > of the table all has the same value for titleid. This really messes up the > sampling algorithm used by ANALYZE. To overcome that, you would have to > increase the statistics target by 3 or 4 orders of magnitude, not a factor > of 3 or 4. > > However, that doesn't seem to be the actual problem. Surprisingly enough, a > hash anti-join doesn't automatically de-duplicate the hash table as it is > being built. So n_distinct correctly does not have an influence on the > estimated RAM usage, because it doesn't influence the actual ram usage > either. > > It sounds like your work_mem is set way too high. What is it set to? And > what version of PostgreSQL are you using? > > Cheers, > > Jeff -- David Hinkle Senior Software Developer Phone: 800.243.3729x3000 Email: hinkle@xxxxxxxxxxxxxx Hours: Mon-Fri 8:00AM-5:00PM (CT) -- Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general