Re: visualizing B-tree index coverage

"Dann Corbit" <DCorbit@xxxxxxxxx> · Thu, 27 Jan 2005 12:57:07 -0800

>>
Rather than having the most selective item first, put the most
frequently used item first.

So (for instance) if your database is an organic chemistry database and
the columns were elements, then carbon might be a logical first choice,
hydrogen second, etc. for your index.

What you want in creating the index is to choose an index where the
items requested for filtering (in either WHERE clause or join columns)
are very sure to exist somewhere in the early part of the index.

A big problem will arise if (for all indexes) there is no instance where
the first column of the index is referenced in the where clause.

That means you will do a complete table scan.

So create an index so for queries the most probable things to ask for
are earliest in the index.  More important than selectivity is the
probability of the column being included in the filter part of the
request.
<<
-----Original Message-----
From: pgsql-general-owner@xxxxxxxxxxxxxx
[mailto:pgsql-general-owner@xxxxxxxxxxxxxx] On Behalf Of TJ O'Donnell
Sent: Thursday, January 27, 2005 7:38 AM
To: PFC
Cc: pgsql-general@xxxxxxxxxxxxxx
Subject: Re:  visualizing B-tree index coverage

I realize that using OR will not result in an index scan.
I will never be interested in a OR condition for the kinds
of searches I use.  In my Select statements, I always name
every column of the multi-column index in same order that
they were named when creating the index.  I always use
the >= condition, and very rarely, the = condition.

However, I am concerned that I must place
the most selective column first in my index.  I cannot tell,
a priori, which column will be most selective.  That depends on the
nature of search, which can vary widely each time.
Are you saying that if my first column is not selective, even though the
remaining
columns are, the planner may choose not to use the index after
seeing that the first column is not very selective?
That seems like an oversight, IMHO.  Shouldn't the overall effect of
using all the columns be considered before choosing not to use an
index scan?

Since I'm using every column of my multi-column index for every search,
and I always use >=, Explain Analyze always shows that every column
is considered in the index scan.  However, that is only when the
index scan is used.  Sometimes, Explain Analyze shows it is not used.
That appears to happen when my search condition is very general.
This it to be expected, so I am not worried.  Most of my searches will
be intermediate, namely not VERY selective, but also not VERY general.
So the idea of the multi-column index is to "characterize" each row
sufficiently, even when it is a perfectly ordinary row with no ONE
feature being distinctive, but rather several features together giving
it it's distinctive character.  That is my interpretation of the
multi-column index.

TJ

PFC wrote:
> 
>     I think you missed an important "feature" of multicolumn indexes, 
> that  you better not use 'OR' in your expressions. You seem to want
only 
> to use  '>=' so this should be OK.
> 
>     Suppose you have 3 columns a,z,e containing values linearly 
> distributed  between ...
> 
> select min(a),max(a),min(z),max(z),min(e),max(e) from test;
>  min | max | min | max | min | max
> -----+-----+-----+-----+-----+-----
>    0 |  13 |   0 |  99 |   0 |  99
> 
> 
> For instance the following query is indexed :
> explain analyze select * from test where a>=0 and z>=90 and e>=0;
>                                                        QUERY PLAN
>
------------------------------------------------------------------------
------------------------------------------------- 
> 
>  Index Scan using testa on test  (cost=0.00..1637.56 rows=11345 
> width=16)  (actual time=0.085..51.441 rows=13000 loops=1)
>    Index Cond: ((a >= 0) AND (z >= 90) AND (e >= 0))
>  Total runtime: 56.307 ms
> 
> The following is only partially indexed :
> 
> explain analyze select * from test where (a=1 or a=2) and (z=1 or z=8)

> and  e>=0;
>                                                          QUERY PLAN
>
------------------------------------------------------------------------
---------------------------------------------------- 
> 
>  Index Scan using testa, testa on test  (cost=0.00..3269.06 rows=346  
> width=16) (actual time=0.328..52.961 rows=400 loops=1)
>    Index Cond: ((a = 1) OR (a = 2))
>    Filter: (((z = 1) OR (z = 8)) AND (e >= 0))
>  Total runtime: 53.297 ms
> 
>     You see the 'index cond' field which is what determines the
fetched 
> rows,  which are then fetched and filtered with the 'filter'
expression. 
> Having  the most selective index cond is important because it will 
> diminish the  number of rows to be fetched. However, in your case the 
> filter expression  is also beneficial because any row eliminated by
the 
> filter will not need  to go through your expensive matching function.
> 
> In this case :
> 
> SELECT count(*) FROM test;
>     => 131072
> 
> SELECT count(*) FROM test WHERE ((a = 1) OR (a = 2));
>     => 20000
> 
> SELECT count(*) FROM test WHERE (a=1 or a=2) and (z=1 or z=8) and
e>=0;
>     => 400
> 
> In this case the index fetches 20k rows out of 131072 but only 400 are

> used...
> 
> If you don't use OR, index use is more likely :
> 
> explain analyze select * from test where (a,z,e) >= (0,50,80);
>                                                        QUERY PLAN
>
------------------------------------------------------------------------
------------------------------------------------- 
> 
>  Index Scan using testa on test  (cost=0.00..1669.78 rows=12627 
> width=16)  (actual time=0.087..58.316 rows=13000 loops=1)
>    Index Cond: ((a >= 0) AND (z >= 50) AND (e >= 80))
>  Total runtime: 63.049 ms
> 
> Here you have a full index scan.
> 
> To determine the efficiency of your indexes, you can thus use this 
> method,  and look at the 'index cond' and 'filter' expressions, and 
> counting the  rows matched by each.
> 

> 
>> particular number of columns
>> for indexing.  I don't want to use too many, nor too few columns.  I
also
>> want to optimize the nature(which atom types, bond types, etc.)
>> of the count columns.  While I could do this
>> and use the speedup as the measure of success, I think
>> that if my B-tree were "covering" the data well, I would get the best

>> results.
>> Covering means finding that optimal situation where there is not one

>> index for all rows
>> and also not a unique index for every row - something inbetween would

>> be  ideal,
>> or is that basically a wrong idea?

---------------------------(end of broadcast)---------------------------
TIP 4: Don't 'kill -9' the postmaster

---------------------------(end of broadcast)---------------------------
TIP 3: if posting/reading through Usenet, please send an appropriate
      subscribe-nomail command to majordomo@xxxxxxxxxxxxxx so that your
      message can get through to the mailing list cleanly