Re: visualizing B-tree index coverage

"Dann Corbit" <DCorbit@xxxxxxxxx> · Tue, 25 Jan 2005 16:01:42 -0800

Normally, a unique, clustered index is the silver bullet to the best
performance (but my experience with unique clustered indexes is largely
in non-PostgreSQL database systems -- so take it with a grain of salt).

I do not see any extra expense for a unique index verses one that is
mostly unique.  Further, if an index is unique, that should be an
excellent optimizer hint for query acceleration.

If you know what queries you run most frequently, I would tailor the
index for optimal query execution via the join columns and columns often
involved in where clause filtering.

If it is easily possible to make a unique index I would definitely time
the queries with a unique index as well.   I suspect that the unique
index will fare better unless there is something odd about your data.

-----Original Message-----
From: TJ O'Donnell [mailto:tjo@xxxxxxx] 
Sent: Tuesday, January 25, 2005 3:50 PM
To: pgsql-general@xxxxxxxxxxxxxx
Cc: Dann Corbit
Subject: RE:  visualizing B-tree index coverage

Since I'm using a multi-column index, I can greatly influence
the nature of the index created, depending on which columns I use
and how many.  I'm searching for an optimal set
of columns that creates an index that, for sure does not have
every value the same, nor only two values.  Instead, I want to see
how well I've spread the index out over the data (if that phrasing makes
sense).

More specifically, I have character data representing molecular
structures.
I've written (rather slow) search functions.  I can create any number of
columns that "fingerprint" each structure, e.g. # Carbon atoms, # N
atoms,
# single bonds, etc.  I expect my fingerprints will not be unique
(fingerprint may
be a poor analogy), but rather will classify similar structures
together.
I create a multi-column index on these counts and
get about 2-3 times speedup using 13 columns right now.
For example:

select count(smiles) from structure where
oe_matches(smiles,'c1ccccc1CC(=O)NC')  about 15 sec.

select count(smiles) from structure where
 (_c, _n, _o, _s, _p, _halo,
  _arom_c, _arom_n, _arom_o, _arom_s,
  _atoms, _single_bonds, _other_bonds)  >=
 ( 3,1,1,0,0,0, 6,0,0,0, 11,4,7 )
 and oe_matches(smiles,'c1ccccc1CC(=O)NC')   about 6 seconds
when the (_c, etc.) is a multi-column index.

The data isn't inherently structured in any way that invites some
particular number of columns
for indexing.  I don't want to use too many, nor too few columns.  I
also
want to optimize the nature(which atom types, bond types, etc.)
of the count columns.  While I could do this
and use the speedup as the measure of success, I think
that if my B-tree were "covering" the data well, I would get the best
results.
Covering means finding that optimal situation where there is not one
index for all rows
and also not a unique index for every row - something inbetween would be
ideal,
or is that basically a wrong idea?

TJ

> Useful explanation of PostgreSQL index format:
> http://www.faqs.org/docs/ppbook/c13329.htm
>
> I think you are aiming for the wrong thing.
> The worst possible index is one with every value the same.
> The second worst (still basically useless) is one with only two
values. The greater the
> differentiation of the data, the more workload is
> reduced on a search.
>
> Since it isn't a straight binary tree, I don't think that having
highly dissimilar data in the
> index should be a problem.
>
> Do you have data or experience that shows otherwise?
>
> -----Original Message-----
> From: pgsql-general-owner@xxxxxxxxxxxxxx
> [mailto:pgsql-general-owner@xxxxxxxxxxxxxx] On Behalf Of TJ O'Donnell
Sent: Tuesday, January 25,
> 2005 2:19 PM
> To: pgsql-general@xxxxxxxxxxxxxx
> Cc: tjo@xxxxxxx
> Subject:  visualizing B-tree index coverage
>
> Does anyone know of a tools that allows one to visualize
> the tree created by a multi-column B-tree index?
> A picture of a tree with branches, showing how "branchy" the
> tree is would be great.
> I'm wondering how well I've "clustered" the data in my table
> using the multi-column index.  In other words, do my
> multi-columns sufficiently but not overly discriminate rows from each
other?
> Do I have too many with the same index? (not enough branches)
> Do I have a unique index for each row? (way too many branches)
>
> Thanks,
> TJ
>
>
>
> ---------------------------(end of
broadcast)--------------------------- TIP 7: don't forget to
> increase your free space map settings

---------------------------(end of broadcast)---------------------------
TIP 6: Have you searched our list archives?

               http://archives.postgresql.org