Here's a weird one I can't figure out: the definitions of several columns of a view, which are not used in a query at all, have a massive effect on the query planner, causing it to choose a seqscan over the largest table in our database when it should be using the primary key for the join.
Background: We've redesigned the tables that hold our primary data, but need to create views that mimic the old design so that our applications will continue working. The largest table ("chemaxon.sdf") holds the bulk of the data, and we've changed it from raw text to gzipped bytea. To mimic the old schema, I created a short Perl function that does a simple gunzip operation, and used that in the definition of the view "str_conntab". (This gzip reduces our total database size to about a third of the original -- it's very effective).
Here are two query plans. The first is horrible. For the second, I removed the gunzip functions and replaced them with constant values. But notice that these pseudo columns are not used anywhere in the query. (Even if they were, I don't understand why this should affect the planner.)
The tables VERSION and VERSION_PROPERTIES are also views; I've included their definitions and the underlying actual tables below.
Postgres 9.6.7 running on Ubuntu 16.04.
emolecules=> drop view str_conntab;
DROP VIEW
emolecules=> create view str_conntab as
emolecules-> (select
emolecules(> id,
emolecules(> length(gunzip(sdf_gzip)) as contab_len,
emolecules(> gunzip(sdf_gzip) as contab_data,
emolecules(> ''::text as normalized
emolecules(> from chemaxon.sdf);
CREATE VIEW
emolecules=> explain analyze
select VERSION.VERSION_ID, VERSION.ISOSMILES, VERSION_PROPERTIES.MOLECULAR_WEIGHT, VERSION_PROPERTIES.MOLECULAR_FORMULA
from VERSION
join VERSION_PROPERTIES on (VERSION.VERSION_ID = VERSION_PROPERTIES.VERSION_ID)
join STR_CONNTAB on (VERSION.VERSION_ID = STR_CONNTAB.ID)
where VERSION.VERSION_ID in (1485909,1485889,1485903,1485887,1485892,1485900,1485895,1485898,1485906,1485884);
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------------------------
Nested Loop (cost=62.99..162425.77 rows=5 width=60) (actual time=34.718..152828.351 rows=10 loops=1)
-> Nested Loop (cost=62.56..162422.84 rows=6 width=55) (actual time=34.701..152828.289 rows=10 loops=1)
-> Nested Loop (cost=62.14..162419.48 rows=7 width=51) (actual time=34.694..152828.250 rows=10 loops=1)
-> Hash Join (cost=61.72..162415.16 rows=9 width=47) (actual time=34.663..152828.110 rows=10 loops=1)
-> Seq Scan on sdf (cost=0.00..158488.50 rows=281080 width=72) (actual time=33.623..152630.514 rows=281080 loops=1)
-> Hash (cost=61.59..61.59 rows=10 width=43) (actual time=0.028..0.028 rows=10 loops=1)
Buckets: 1024 Batches: 1 Memory Usage: 9kB
-> Index Scan using smiles_pkey on smiles s (cost=0.42..61.59 rows=10 width=43) (actual time=0.010..0.022 rows=10 loops=1)
Index Cond: (id = ANY ('{1485909,1485889,1485903,1485887,1485892,1485900,1485895,1485898,1485906,1485884}'::integer[]))
-> Index Only Scan using parent_pkey on parent p (cost=0.42..0.47 rows=1 width=4) (actual time=0.011..0.011 rows=1 loops=10)
Index Cond: (id = sdf.id)
Heap Fetches: 10
-> Index Only Scan using parent_pkey on parent (cost=0.42..0.47 rows=1 width=4) (actual time=0.002..0.002 rows=1 loops=10)
Index Cond: (id = sdf.id)
Heap Fetches: 10
-> Index Scan using properties_pkey on properties p_1 (cost=0.42..0.48 rows=1 width=21) (actual time=0.003..0.004 rows=1 loops=10)
Index Cond: (id = sdf.id)
Planning time: 1.330 ms
Execution time: 152828.506 ms
(23 rows)
emolecules=> drop view str_conntab;
DROP VIEW
emolecules=> create view str_conntab as
emolecules-> (select
emolecules(> id,
emolecules(> 0::integer contab_len,
emolecules(> null::text as contab_data,
emolecules(> ''::text as normalized
emolecules(> from chemaxon.sdf);
CREATE VIEW
emolecules=> explain analyze
select VERSION.VERSION_ID, VERSION.ISOSMILES, VERSION_PROPERTIES.MOLECULAR_WEIGHT, VERSION_PROPERTIES.MOLECULAR_FORMULA
from VERSION
join VERSION_PROPERTIES on (VERSION.VERSION_ID = VERSION_PROPERTIES.VERSION_ID)
join STR_CONNTAB on (VERSION.VERSION_ID = STR_CONNTAB.ID)
where VERSION.VERSION_ID in (1485909,1485889,1485903,1485887,1485892,1485900,1485895,1485898,1485906,1485884);
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------------------
Nested Loop (cost=2.11..156.71 rows=5 width=60) (actual time=0.018..0.096 rows=10 loops=1)
-> Nested Loop (cost=1.69..153.77 rows=6 width=55) (actual time=0.015..0.076 rows=10 loops=1)
-> Nested Loop (cost=1.27..150.41 rows=7 width=51) (actual time=0.012..0.059 rows=10 loops=1)
-> Nested Loop (cost=0.84..146.09 rows=9 width=47) (actual time=0.008..0.037 rows=10 loops=1)
-> Index Scan using smiles_pkey on smiles s (cost=0.42..61.59 rows=10 width=43) (actual time=0.003..0.016 rows=10 loops=1)
Index Cond: (id = ANY ('{1485909,1485889,1485903,1485887,1485892,1485900,1485895,1485898,1485906,1485884}'::integer[]))
-> Index Only Scan using sdf_pkey on sdf (cost=0.42..8.44 rows=1 width=4) (actual time=0.002..0.002 rows=1 loops=10)
Index Cond: (id = s.id)
Heap Fetches: 10
-> Index Only Scan using parent_pkey on parent p (cost=0.42..0.47 rows=1 width=4) (actual time=0.001..0.001 rows=1 loops=10)
Index Cond: (id = sdf.id)
Heap Fetches: 10
-> Index Only Scan using parent_pkey on parent (cost=0.42..0.47 rows=1 width=4) (actual time=0.001..0.001 rows=1 loops=10)
Index Cond: (id = sdf.id)
Heap Fetches: 10
-> Index Scan using properties_pkey on properties p_1 (cost=0.42..0.48 rows=1 width=21) (actual time=0.001..0.002 rows=1 loops=10)
Index Cond: (id = sdf.id)
Planning time: 1.251 ms
Execution time: 0.147 ms
(22 rows)
The timing of the second query is excellent, and is what I expected. I don't understand why including a function-defined column in the view would have such a dramatic effect on the planner's ability to choose the sdf_pkey index for the join.
Here are the view and table definitions:
emolecules=> \d+ version
View "registry.version"
Column | Type | Collation | Nullable | Default | Storage | Description
------------+---------+-----------+----------+---------+----------+-------------
version_id | integer | | | | plain |
parent_id | integer | | | | plain |
isosmiles | text | | | | extended |
created | abstime | | | | plain |
View definition:
SELECT s.id AS version_id,
p.parent_id,
s.smiles AS isosmiles,
timenow() AS created
FROM chemaxon.smiles s
emolecules=> \d+ version_properties
View "registry.version_properties"
Column | Type | Collation | Nullable | Default | Storage | Description
-------------------+--------------+-----------+----------+---------+----------+-------------
version_id | integer | | | | plain |
molecular_weight | numeric(8,3) | | | | main |
molecular_formula | text | | | | extended |
mfcd | text | | | | extended |
cas_number | text | | | | extended |
View definition:
SELECT p.id AS version_id,
p.molecular_weight,
p.molecular_formula,
m.mfcd,
c.cas_number
FROM chemaxon.properties p
LEFT JOIN chemaxon.mfcd m USING (id)
LEFT JOIN chemaxon.cas_number c USING (id)
JOIN chemaxon.parent USING (id);
Table "chemaxon.smiles"
Column | Type | Collation | Nullable | Default | Storage | Stats target | Description
--------+---------+-----------+----------+---------+----------+--------------+-------------
id | integer | | not null | | plain | |
smiles | text | | not null | | extended | |
Indexes:
"smiles_pkey" PRIMARY KEY, btree (id)
"i_unique_smiles" UNIQUE, btree (smiles)
Table "chemaxon.parent"
Column | Type | Collation | Nullable | Default | Storage | Stats target | Description
-----------+---------+-----------+----------+---------+---------+--------------+-------------
id | integer | | not null | | plain | |
parent_id | integer | | not null | | plain | |
Indexes:
"parent_pkey" PRIMARY KEY, btree (id)
"i_parent_parent_id" btree (parent_id)
emolecules=> \d chemaxon.cas_number
Table "chemaxon.cas_number"
Column | Type | Collation | Nullable | Default
------------+---------+-----------+----------+---------
id | integer | | not null |
cas_number | text | | |
Indexes:
"cas_number_pkey" PRIMARY KEY, btree (id)
"i_cas_number_cas_number" btree (cas_number)
emolecules=> \d chemaxon.cas_number
Table "chemaxon.cas_number"
Column | Type | Collation | Nullable | Default
------------+---------+-----------+----------+---------
id | integer | | not null |
cas_number | text | | |
Indexes:
"cas_number_pkey" PRIMARY KEY, btree (id)
"i_cas_number_cas_number" btree (cas_number)
And the function "gunzip" is defined in perl (unsafe Perl) as:
create or replace function gunzip(bytea) returns text as
$gunzip$
use IO::Uncompress::Gunzip qw(gunzip $GunzipError);
my $compressed = decode_bytea($_[0]);
my $uncompressed;
if (!gunzip(\$compressed, \$uncompressed)) {
return $GunzipError;
}
return $uncompressed;
$gunzip$
language plperlu;
Thanks!
Craig
---------------------------------
Craig A. James
Chief Technology OfficerCraig A. James