I guess this doesn't work, latest test run crashed. It still uses the bad plan for the hostid column even after n_distinct is updated. cipafilter=# select attname, n_distinct from pg_stats where tablename cipafilter-# = 'log_raw' and (attname = 'urlid' or attname = 'titleid' or attname = cipafilter(# 'hostid'); attname | n_distinct ---------+------------- urlid | 1.51625e+08 hostid | 304710 titleid | 886499 (3 rows) cipafilter=# explain DELETE FROM hosts WHERE NOT EXISTS ( SELECT log_raw.hostid FROM log_raw WHERE log_raw.hostid = hosts.hostid ); QUERY PLAN ----------------------------------------------------------------------------------------- Delete on hosts (cost=22249475.67..74844813.47 rows=1 width=12) -> Hash Anti Join (cost=22249475.67..74844813.47 rows=1 width=12) Hash Cond: (hosts.hostid = log_raw.hostid) -> Seq Scan on hosts (cost=0.00..5017.10 rows=304710 width=10) -> Hash (cost=12799395.52..12799395.52 rows=543645052 width=10) -> Seq Scan on log_raw (cost=0.00..12799395.52 rows=543645052 width=10) (6 rows) I guess I will also try throwing in 'set enable_hashjoin = false;' and see if that gets these purges to go. On Thu, Feb 16, 2017 at 2:22 PM, David Hinkle <hinkle@xxxxxxxxxxxxxx> wrote: > Yep, 420ish million records out of 540 million records have a titleid > of 1. There are about 880,000 other unique values, but most of the > records are 1. Of course, n_distinct is only 292. I'm surprised > it's not eliminating the duplicates while it builds that hash table. > > This is what I'm doing for a work around right now. Getting > n_distinct right seems to be preventing the system from breaking. > It's going to be executed once a week during the weekly maintenance. > It's setting the n_distinct of each column to the number of rows in > the associated table. > > CREATE OR REPLACE FUNCTION patch_ndistinct(_table varchar, _column > varchar, _string_table varchar) > RETURNS real AS > $$ > DECLARE _cnt REAL; > BEGIN > SELECT reltuples INTO _cnt from pg_class where relname = _string_table; > EXECUTE 'ALTER TABLE ' || _table || ' ALTER COLUMN ' || _column > || ' SET (n_distinct=' || _cnt || ')'; > RETURN _cnt; > END > $$ LANGUAGE plpgsql; > select patch_ndistinct('log_raw', 'titleid', 'titles'); > select patch_ndistinct('log_raw', 'urlid', 'urls'); > select patch_ndistinct('log_raw', 'hostid', 'hosts'); > ANALYZE log_raw; > > On Thu, Feb 16, 2017 at 10:54 AM, Tom Lane <tgl@xxxxxxxxxxxxx> wrote: >> David Hinkle <hinkle@xxxxxxxxxxxxxx> writes: >>> Tom, there are three columns in this table that exhibit the problem, >>> here is the statistics data after an analyze, and the real data to >>> compare it to. >> >>> attname | n_distinct | most_common_freqs >> >>> titleid | 292 | {0.767167} >> >> Ouch. That's saying there's some single value of titleid that accounts >> for more than three-quarters of the entries ... does that square with >> reality? That'd certainly explain why a hash join goes nuts. >> >> regards, tom lane > > > > -- > David Hinkle > > Senior Software Developer > > Phone: 800.243.3729x3000 > > Email: hinkle@xxxxxxxxxxxxxx > > Hours: Mon-Fri 8:00AM-5:00PM (CT) -- David Hinkle Senior Software Developer Phone: 800.243.3729x3000 Email: hinkle@xxxxxxxxxxxxxx Hours: Mon-Fri 8:00AM-5:00PM (CT) -- Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general