wdsah=> select version(); version ----------------------------------------------------------------------------------------------- PostgreSQL 9.1.15 on x86_64-unknown-linux-gnu, compiled by gcc (Debian 4.7.2-5) 4.7.2, 64-bit (1 row) I plan to upgrade to Debian 8 (with Postgres 9.4) soon, so the problem may go away, but I would still like to understand what is happening here. IRL the queries are a bit more complicated (they involve two additional tables), but I can demonstrate it with just two: wdsah=> \d facttable_stat_fta4 Table "public.facttable_stat_fta4" Column | Type | Modifiers ---------------------+-----------------------------+----------- macrobondtimeseries | character varying(255) | not null date | date | not null value | double precision | berechnungsart | character varying | einheit | character varying | kurzbezeichnung | character varying | partnerregion | character varying | og | character varying | sitcr4 | character varying | warenstrom | character varying | valid_from | timestamp without time zone | from_job_queue_id | integer | kommentar | character varying | Indexes: "facttable_stat_fta4_pkey" PRIMARY KEY, btree (macrobondtimeseries, date) "facttable_stat_fta4_berechnungsart_idx" btree (berechnungsart) "facttable_stat_fta4_einheit_idx" btree (einheit) "facttable_stat_fta4_og_idx" btree (og) "facttable_stat_fta4_partnerregion_idx" btree (partnerregion) "facttable_stat_fta4_sitcr4_idx" btree (sitcr4) "facttable_stat_fta4_warenstrom_idx" btree (warenstrom) wdsah=> select count(*) from facttable_stat_fta4; count ---------- 43577941 (1 row) wdsah=> \d term Table "public.term" Column | Type | Modifiers ------------------------+-----------------------------+------------------------ facttablename | character varying | columnname | character varying | term | character varying | concept_id | integer | not null language | character varying | register | character varying | hidden | boolean | cleansing_job_queue_id | integer | not null default (-1) meta_insert_dt | timestamp without time zone | not null default now() meta_update_dt | timestamp without time zone | valid_from | timestamp without time zone | from_job_queue_id | integer | Indexes: "term_concept_id_idx" btree (concept_id) "term_facttablename_columnname_idx" btree (facttablename, columnname) "term_facttablename_idx" btree (facttablename) "term_facttablename_idx1" btree (facttablename) WHERE facttablename IS NOT NULL AND columnname::text = 'macrobondtimeseries'::text "term_language_idx" btree (language) "term_register_idx" btree (register) "term_term_ftidx" gin (to_tsvector('simple'::regconfig, term::text)) "term_term_idx" btree (term) Check constraints: "term_facttablename_needs_columnname_chk" CHECK (facttablename IS NULL OR columnname IS NOT NULL) Foreign-key constraints: "term_concept_id_fkey" FOREIGN KEY (concept_id) REFERENCES concept(id) DEFERRABLE wdsah=> select count(*) from term; count --------- 6109087 (1 row) The purpose of the query is to find all terms which occur is a given column of the facttable (again, IRL this is a bit more complicated), basically an optimized version of select distinct. Some of my columns have very few distinct members: wdsah=> select * from pg_stats where tablename='facttable_stat_fta4' and attname in ('einheit', 'berechnungsart', 'warenstrom'); schemaname | tablename | attname | inherited | null_frac | avg_width | n_distinct | most_common_vals | most_common_freqs | histogram_bounds | correlation ------------+---------------------+----------------+-----------+-----------+-----------+------------+------------------+---------------------+------------------+------------- public | facttable_stat_fta4 | berechnungsart | f | 0 | 2 | 2 | {n,m} | {0.515167,0.484833} | | 0.509567 public | facttable_stat_fta4 | einheit | f | 0 | 3 | 2 | {EUR,kg} | {0.515167,0.484833} | | 0.491197 public | facttable_stat_fta4 | warenstrom | f | 0 | 2 | 2 | {X,M} | {0.580267,0.419733} | | -0.461344 (3 rows) And for some of them my query is indeed very fast: wdsah=> explain analyze select facttablename, columnname, term, concept_id, t.hidden, language, register from term t where facttablename='facttable_stat_fta4' and columnname='einheit' and exists (select 1 from facttable_stat_fta4 f where f.einheit=t.term ); QUERY PLAN ------------------------------------------------------------------------------------------------------------------------------------------------------------------------- Nested Loop Semi Join (cost=0.00..384860.48 rows=1 width=81) (actual time=0.061..0.119 rows=2 loops=1) -> Index Scan using term_facttablename_columnname_idx on term t (cost=0.00..391.46 rows=636 width=81) (actual time=0.028..0.030 rows=3 loops=1) Index Cond: (((facttablename)::text = 'facttable_stat_fta4'::text) AND ((columnname)::text = 'einheit'::text)) -> Index Scan using facttable_stat_fta4_einheit_idx on facttable_stat_fta4 f (cost=0.00..384457.80 rows=21788970 width=3) (actual time=0.027..0.027 rows=1 loops=3) Index Cond: ((einheit)::text = (t.term)::text) Total runtime: 0.173 ms (6 rows) 0.17 ms. Much faster than a plain select distinct over a table with 43 million rows could ever hope to be. warenstrom is very similar and the columns with more distinct values aren't that bad either. But for column berechnungsart the result is bad: wdsah=> explain analyze select facttablename, columnname, term, concept_id, t.hidden, language, register from term t where facttablename='facttable_stat_fta4' and columnname='berechnungsart' and exists (select 1 from facttable_stat_fta4 f where f.berechnungsart=t.term ); QUERY PLAN -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- Merge Semi Join (cost=316864.57..319975.79 rows=1 width=81) (actual time=7703.917..30948.271 rows=2 loops=1) Merge Cond: ((t.term)::text = (f.berechnungsart)::text) -> Index Scan using term_term_idx on term t (cost=0.00..319880.73 rows=636 width=81) (actual time=7703.809..7703.938 rows=3 loops=1) Filter: (((facttablename)::text = 'facttable_stat_fta4'::text) AND ((columnname)::text = 'berechnungsart'::text)) -> Index Scan using facttable_stat_fta4_berechnungsart_idx on facttable_stat_fta4 f (cost=0.00..2545748.85 rows=43577940 width=2) (actual time=0.089..16263.582 rows=21336180 loops=1) Total runtime: 30948.648 ms (6 rows) Over 30 seconds! That's almost 200'000 times slower. The weird thing is that for this particular table einheit and berechnungsart actually have a 1:1 correspondence. Not only is the frequency the same, every row where einheit='kg' has berechnungsart='m' and every row where einheit='EUR' has berechnungsart='n'. So I don't see why two different execution plans are chosen. hp -- _ | Peter J. Holzer | I want to forget all about both belts and |_|_) | | suspenders; instead, I want to buy pants | | | hjp@xxxxxx | that actually fit. __/ | http://www.hjp.at/ | -- http://noncombatant.org/
Attachment:
signature.asc
Description: Digital signature