Search Postgresql Archives

Re: visualizing B-tree index coverage

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



TJ O'Donnell wrote:
More specifically, I have character data representing molecular structures.
I've written (rather slow) search functions.  I can create any number of
columns that "fingerprint" each structure, e.g. # Carbon atoms, # N atoms,
# single bonds, etc.  I expect my fingerprints will not be unique (fingerprint may
be a poor analogy), but rather will classify similar structures together.
I create a multi-column index on these counts and
get about 2-3 times speedup using 13 columns right now.
For example:
select count(smiles) from structure where  oe_matches(smiles,'c1ccccc1CC(=O)NC')  about 15 sec.

select count(smiles) from structure where
 (_c, _n, _o, _s, _p, _halo,
  _arom_c, _arom_n, _arom_o, _arom_s,
  _atoms, _single_bonds, _other_bonds)  >=
 ( 3,1,1,0,0,0, 6,0,0,0, 11,4,7 )
 and oe_matches(smiles,'c1ccccc1CC(=O)NC')   about 6 seconds
when the (_c, etc.) is a multi-column index.
Is 3 really only a lower bound for the number of carbon atoms (which, I guess is what _c represents), or will there be exactly 3 carbon atoms in
the molecules found? If not, can you estimate an upper bound, as well as
an lower bound? If you can, than you cound further optimize this by using a range query (i.e. select ... from ... where (...) >= (...) and (...) <= (...( and oe_matches(...)).


If you can only determine bounds (upper _and_ lower) for some of the "fingerprint" columns, that I would suggest you choose those for the index, and use them as the "first" columns (because otherwise a range-scan won't work).

"analyze" gathers some statistics about the distribution of values in a table - I guess you could extract at least an approximation of "index coverage" from there.

Since there cost of index inserts/updates/deletes is afaik largly independent from the number of columns the index consists of, I'd suggest that you add columns until the stats show that it's nearly an unique index - but prefer columns for which you can compute strong upper & lower bounds, and put those "in front".

The data isn't inherently structured in any way that invites some particular number of columns
for indexing.  I don't want to use too many, nor too few columns.  I also
want to optimize the nature(which atom types, bond types, etc.)
of the count columns.  While I could do this
and use the speedup as the measure of success, I think
that if my B-tree were "covering" the data well, I would get the best results.
Covering means finding that optimal situation where there is not one index for all rows
and also not a unique index for every row - something inbetween would be ideal,
or is that basically a wrong idea?
I believe it depends on the costs of executing one oe_matches call - according to you, those costs are quite heavy, so I every call you can omit by adding more columns to the index is worth the performance penalty.

If you need fast adding of data, as well as fast searches, you could add
the data into a "pending" table that is not indexed, and union the searches over both tables. At low-traffic times (say, at night), you could move the data from the "pending" table to the "structure" table. This of course only works, if the number of tuples insertes between two move-runs is much smaller than the number of tuples in "structure".


greetings, Florian Pflug


---------------------------(end of broadcast)--------------------------- TIP 8: explain analyze is your friend

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Index of Archives]     [Postgresql Jobs]     [Postgresql Admin]     [Postgresql Performance]     [Linux Clusters]     [PHP Home]     [PHP on Windows]     [Kernel Newbies]     [PHP Classes]     [PHP Books]     [PHP Databases]     [Postgresql & PHP]     [Yosemite]
  Powered by Linux