Re: visualizing B-tree index coverage

"Florian G. Pflug" <fgp@xxxxxxxx> · Thu, 27 Jan 2005 14:22:33 +0100

TJ O'Donnell wrote:
More specifically, I have character data representing molecular structures.
I've written (rather slow) search functions.  I can create any number of
columns that "fingerprint" each structure, e.g. # Carbon atoms, # N atoms,
# single bonds, etc.  I expect my fingerprints will not be unique (fingerprint may
be a poor analogy), but rather will classify similar structures together.
I create a multi-column index on these counts and
get about 2-3 times speedup using 13 columns right now.
For example:
select count(smiles) from structure where  oe_matches(smiles,'c1ccccc1CC(=O)NC')  about 15 sec.

select count(smiles) from structure where
 (_c, _n, _o, _s, _p, _halo,
  _arom_c, _arom_n, _arom_o, _arom_s,
  _atoms, _single_bonds, _other_bonds)  >=
 ( 3,1,1,0,0,0, 6,0,0,0, 11,4,7 )
 and oe_matches(smiles,'c1ccccc1CC(=O)NC')   about 6 seconds
when the (_c, etc.) is a multi-column index.
Is 3 really only a lower bound for the number of carbon atoms (which, I 
guess is what _c represents), or will there be exactly 3 carbon atoms in

the molecules found? If not, can you estimate an upper bound, as well as

an lower bound? If you can, than you cound further optimize this by 
using a range query (i.e. select ... from ... where (...) >= (...) and 
(...) <= (...( and oe_matches(...)).

If you can only determine bounds (upper _and_ lower) for some of the 
"fingerprint" columns, that I would suggest you choose those for the 
index, and use them as the "first" columns (because otherwise a 
range-scan won't work).

"analyze" gathers some statistics about the distribution of values in a 
table - I guess you could extract at least an approximation of "index 
coverage" from there.

Since there cost of index inserts/updates/deletes is afaik largly 
independent from the number of columns the index consists of, I'd 
suggest that you add columns until the stats show that it's nearly an 
unique index - but prefer columns for which you can compute strong upper 
 & lower bounds, and put those "in front".

The data isn't inherently structured in any way that invites some particular number of columns
for indexing.  I don't want to use too many, nor too few columns.  I also
want to optimize the nature(which atom types, bond types, etc.)
of the count columns.  While I could do this
and use the speedup as the measure of success, I think
that if my B-tree were "covering" the data well, I would get the best results.
Covering means finding that optimal situation where there is not one index for all rows
and also not a unique index for every row - something inbetween would be ideal,
or is that basically a wrong idea?
I believe it depends on the costs of executing one oe_matches call - 
according to you, those costs are quite heavy, so I every call you can 
omit by adding more columns to the index is worth the performance penalty.

If you need fast adding of data, as well as fast searches, you could add

the data into a "pending" table that is not indexed, and union the 
searches over both tables. At low-traffic times (say, at night), you 
could move the data from the "pending" table to the "structure" table. 
This of course only works, if the number of tuples insertes between two 
move-runs is much smaller than the number of tuples in "structure".

greetings, Florian Pflug

---------------------------(end of broadcast)---------------------------
TIP 8: explain analyze is your friend