Different plan for very similar queries

"Peter J. Holzer" <hjp-pgsql@xxxxxx> · Fri, 29 May 2015 10:55:44 +0200

wdsah=> select version();
                                            version                                            
-----------------------------------------------------------------------------------------------
 PostgreSQL 9.1.15 on x86_64-unknown-linux-gnu, compiled by gcc (Debian 4.7.2-5) 4.7.2, 64-bit
(1 row)

I plan to upgrade to Debian 8 (with Postgres 9.4) soon, so the problem
may go away, but I would still like to understand what is happening
here.

IRL the queries are a bit more complicated (they involve two additional
tables), but I can demonstrate it with just two:

wdsah=> \d facttable_stat_fta4
              Table "public.facttable_stat_fta4"
       Column        |            Type             | Modifiers 
---------------------+-----------------------------+-----------
 macrobondtimeseries | character varying(255)      | not null
 date                | date                        | not null
 value               | double precision            | 
 berechnungsart      | character varying           | 
 einheit             | character varying           | 
 kurzbezeichnung     | character varying           | 
 partnerregion       | character varying           | 
 og                  | character varying           | 
 sitcr4              | character varying           | 
 warenstrom          | character varying           | 
 valid_from          | timestamp without time zone | 
 from_job_queue_id   | integer                     | 
 kommentar           | character varying           | 
Indexes:
    "facttable_stat_fta4_pkey" PRIMARY KEY, btree (macrobondtimeseries, date)
    "facttable_stat_fta4_berechnungsart_idx" btree (berechnungsart)
    "facttable_stat_fta4_einheit_idx" btree (einheit)
    "facttable_stat_fta4_og_idx" btree (og)
    "facttable_stat_fta4_partnerregion_idx" btree (partnerregion)
    "facttable_stat_fta4_sitcr4_idx" btree (sitcr4)
    "facttable_stat_fta4_warenstrom_idx" btree (warenstrom)

wdsah=> select count(*) from facttable_stat_fta4;
  count   
----------
 43577941
(1 row)

wdsah=> \d term
                              Table "public.term"
         Column         |            Type             |       Modifiers        
------------------------+-----------------------------+------------------------
 facttablename          | character varying           | 
 columnname             | character varying           | 
 term                   | character varying           | 
 concept_id             | integer                     | not null
 language               | character varying           | 
 register               | character varying           | 
 hidden                 | boolean                     | 
 cleansing_job_queue_id | integer                     | not null default (-1)
 meta_insert_dt         | timestamp without time zone | not null default now()
 meta_update_dt         | timestamp without time zone | 
 valid_from             | timestamp without time zone | 
 from_job_queue_id      | integer                     | 
Indexes:
    "term_concept_id_idx" btree (concept_id)
    "term_facttablename_columnname_idx" btree (facttablename, columnname)
    "term_facttablename_idx" btree (facttablename)
    "term_facttablename_idx1" btree (facttablename) WHERE facttablename IS NOT NULL AND columnname::text = 'macrobondtimeseries'::text
    "term_language_idx" btree (language)
    "term_register_idx" btree (register)
    "term_term_ftidx" gin (to_tsvector('simple'::regconfig, term::text))
    "term_term_idx" btree (term)
Check constraints:
    "term_facttablename_needs_columnname_chk" CHECK (facttablename IS NULL OR columnname IS NOT NULL)
Foreign-key constraints:
    "term_concept_id_fkey" FOREIGN KEY (concept_id) REFERENCES concept(id) DEFERRABLE

wdsah=> select count(*) from term;
  count  
---------
 6109087
(1 row)

The purpose of the query is to find all terms which occur is a given
column of the facttable (again, IRL this is a bit more complicated),
basically an optimized version of select distinct.

Some of my columns have very few distinct members:

wdsah=> select * from pg_stats where tablename='facttable_stat_fta4' and attname in ('einheit', 'berechnungsart', 'warenstrom');
 schemaname |      tablename      |    attname     | inherited | null_frac | avg_width | n_distinct | most_common_vals |  most_common_freqs  | histogram_bounds | correlation 
------------+---------------------+----------------+-----------+-----------+-----------+------------+------------------+---------------------+------------------+-------------
 public     | facttable_stat_fta4 | berechnungsart | f         |         0 |         2 |          2 | {n,m}            | {0.515167,0.484833} |                  |    0.509567
 public     | facttable_stat_fta4 | einheit        | f         |         0 |         3 |          2 | {EUR,kg}         | {0.515167,0.484833} |                  |    0.491197
 public     | facttable_stat_fta4 | warenstrom     | f         |         0 |         2 |          2 | {X,M}            | {0.580267,0.419733} |                  |   -0.461344
(3 rows)

And for some of them my query is indeed very fast:

wdsah=> explain analyze select facttablename, columnname, term, concept_id, t.hidden, language, register 
        from term t where facttablename='facttable_stat_fta4' and columnname='einheit' and exists (select 1 from facttable_stat_fta4 f where f.einheit=t.term );
                                                                               QUERY PLAN                                                                                
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Nested Loop Semi Join  (cost=0.00..384860.48 rows=1 width=81) (actual time=0.061..0.119 rows=2 loops=1)
   ->  Index Scan using term_facttablename_columnname_idx on term t  (cost=0.00..391.46 rows=636 width=81) (actual time=0.028..0.030 rows=3 loops=1)
         Index Cond: (((facttablename)::text = 'facttable_stat_fta4'::text) AND ((columnname)::text = 'einheit'::text))
   ->  Index Scan using facttable_stat_fta4_einheit_idx on facttable_stat_fta4 f  (cost=0.00..384457.80 rows=21788970 width=3) (actual time=0.027..0.027 rows=1 loops=3)
         Index Cond: ((einheit)::text = (t.term)::text)
 Total runtime: 0.173 ms
(6 rows)

0.17 ms. Much faster than a plain select distinct over a table with 43
million rows could ever hope to be. 

warenstrom is very similar and the columns with more distinct values
aren't that bad either. 

But for column berechnungsart the result is bad:

wdsah=> explain analyze select facttablename, columnname, term, concept_id, t.hidden, language, register 
        from term t where facttablename='facttable_stat_fta4' and columnname='berechnungsart' and exists (select 1 from facttable_stat_fta4 f where f.berechnungsart=t.term );
                                                                                         QUERY PLAN                                                                                         
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Merge Semi Join  (cost=316864.57..319975.79 rows=1 width=81) (actual time=7703.917..30948.271 rows=2 loops=1)
   Merge Cond: ((t.term)::text = (f.berechnungsart)::text)
   ->  Index Scan using term_term_idx on term t  (cost=0.00..319880.73 rows=636 width=81) (actual time=7703.809..7703.938 rows=3 loops=1)
         Filter: (((facttablename)::text = 'facttable_stat_fta4'::text) AND ((columnname)::text = 'berechnungsart'::text))
   ->  Index Scan using facttable_stat_fta4_berechnungsart_idx on facttable_stat_fta4 f  (cost=0.00..2545748.85 rows=43577940 width=2) (actual time=0.089..16263.582 rows=21336180 loops=1)
 Total runtime: 30948.648 ms
(6 rows)

Over 30 seconds! That's almost 200'000 times slower. 

The weird thing is that for this particular table einheit and
berechnungsart actually have a 1:1 correspondence. Not only is the
frequency the same, every row where einheit='kg' has berechnungsart='m'
and every row where einheit='EUR' has berechnungsart='n'. So I don't see
why two different execution plans are chosen.

	hp

-- 
   _  | Peter J. Holzer    | I want to forget all about both belts and
|_|_) |                    | suspenders; instead, I want to buy pants 
| |   | hjp@xxxxxx         | that actually fit.
__/   | http://www.hjp.at/ |   -- http://noncombatant.org/
Attachment:
signature.asc

Description: Digital signature