Can't get two index scans

Craig James <cjames@xxxxxxxxxxxxxx> · Wed, 22 Jun 2016 09:03:35 -0700

I'm working with a third-party plugin that does chemistry. It's very fast. However, I'm trying to do a sampling query, such as the first 1% of the database, and I just can't get the planner to create a good plan.  Here is the full query (the |>| operator does a subgraph match of a molecular substructure, in this case benzene, to find all molecules that have a benzene ring in the database):
explain analyze select * from version where smiles |>| 'c1ccccc1';
 ...
 Index Scan using i_version_smiles on version  (cost=3445.75..147094.03 rows=180283 width=36) (actual time=336.493..10015.753
 rows=180973 loops=1)
   Index Cond: (smiles |>| 'c1ccccc1'::molecule)
 Planning time: 1.228 ms
 Execution time: 10371.903 ms

Ten seconds over 263,000 molecules, which is actually good. Now let's limit it to the first 1% of the rows:

explain analyze select * from version where smiles |>| 'c1ccccc1' and version_id < 897630;
...
 Index Scan using pk_version on version  (cost=0.42..131940.05 rows=1643 width=36) (actual time=6.122..2816.298 rows=2039 loops=1)
   Index Cond: (version_id < 897630)
   Filter: (smiles |>| 'c1ccccc1'::molecule)
   Rows Removed by Filter: 590
 Planning time: 1.217 ms
 Execution time: 2822.117 ms

Notice that it doesn't use the i_version_smiles index at all, but instead applies the very expensive filter |>| to all 1% of the database. So instead of getting a 100x speedup, we only get a 3x speedup, about 30x worse that what is theoretically possible.

The production database is about 50x larger than this test database.

Maybe I misunderstand what's possible with indexes, but it seems to me that it could first do the pk_version index scan, and then use the results of that to do a limited index-scan search using the i_version_smiles index. Is that not possible? Is each index scan "self contained", that is, it doesn't take into account the results of another index scan?

Thanks,
Craig