Is 3 really only a lower bound for the number of carbon atoms (which, I guess is what _c represents), or will there be exactly 3 carbon atoms inMore specifically, I have character data representing molecular structures. I've written (rather slow) search functions. I can create any number of columns that "fingerprint" each structure, e.g. # Carbon atoms, # N atoms, # single bonds, etc. I expect my fingerprints will not be unique (fingerprint may be a poor analogy), but rather will classify similar structures together. I create a multi-column index on these counts and get about 2-3 times speedup using 13 columns right now. For example: select count(smiles) from structure where oe_matches(smiles,'c1ccccc1CC(=O)NC') about 15 sec.
select count(smiles) from structure where (_c, _n, _o, _s, _p, _halo, _arom_c, _arom_n, _arom_o, _arom_s, _atoms, _single_bonds, _other_bonds) >= ( 3,1,1,0,0,0, 6,0,0,0, 11,4,7 ) and oe_matches(smiles,'c1ccccc1CC(=O)NC') about 6 seconds when the (_c, etc.) is a multi-column index.
the molecules found? If not, can you estimate an upper bound, as well as
an lower bound? If you can, than you cound further optimize this by using a range query (i.e. select ... from ... where (...) >= (...) and (...) <= (...( and oe_matches(...)).
If you can only determine bounds (upper _and_ lower) for some of the "fingerprint" columns, that I would suggest you choose those for the index, and use them as the "first" columns (because otherwise a range-scan won't work).
"analyze" gathers some statistics about the distribution of values in a table - I guess you could extract at least an approximation of "index coverage" from there.
Since there cost of index inserts/updates/deletes is afaik largly independent from the number of columns the index consists of, I'd suggest that you add columns until the stats show that it's nearly an unique index - but prefer columns for which you can compute strong upper & lower bounds, and put those "in front".
I believe it depends on the costs of executing one oe_matches call - according to you, those costs are quite heavy, so I every call you can omit by adding more columns to the index is worth the performance penalty.The data isn't inherently structured in any way that invites some particular number of columns for indexing. I don't want to use too many, nor too few columns. I also want to optimize the nature(which atom types, bond types, etc.) of the count columns. While I could do this and use the speedup as the measure of success, I think that if my B-tree were "covering" the data well, I would get the best results. Covering means finding that optimal situation where there is not one index for all rows and also not a unique index for every row - something inbetween would be ideal, or is that basically a wrong idea?
If you need fast adding of data, as well as fast searches, you could add
the data into a "pending" table that is not indexed, and union the searches over both tables. At low-traffic times (say, at night), you could move the data from the "pending" table to the "structure" table. This of course only works, if the number of tuples insertes between two move-runs is much smaller than the number of tuples in "structure".
greetings, Florian Pflug
---------------------------(end of broadcast)--------------------------- TIP 8: explain analyze is your friend