Re: usage of indexes for inner joins

"Scott Marlowe" <scott.marlowe@xxxxxxxxx> · Mon, 1 Oct 2007 10:16:52 -0500

On 9/27/07, Jan Theodore Galkowski <bayesianlogic@xxxxxxx> wrote:
> I fear this has been asked many times about PostgreSQL, and I have read
> the docs about how indexes are supposed to be defined and used, but I
> don't understand why the engine and optimizer is doing what it does in
> the simplest of situations.  Is it that its tuning is heavily data
> dependent?

This has in fact been discussed many times on the mailing list.
Searching the archives will probably return lots of discussions.

> Like *how* *come*?  There are indexes on both columns of the join.  Is
> it the NUMERIC datatype messing things up?  Unlikely, as I've seen the
> same with INTEGERs.

Postgresql doesn't store "visibility" information in indexes.  this
means that once you find the entry in an index, you then have to see
if it's visible to the current transaction, and that information is
only stored in the tables.  And there are lots of discussions of why
that is in the archives as well.  Basically race conditions make it
impossible to update the table and index concurrently without ugly
locking issues popping up.

So, in pgsql, whether there's an index or not, the db has to hit the
table in the end.

> If it is data dependent (these tables are presently empty), any
> suggestions as to how to tune a database for unknown mixes of data?

No it isn't.  It is range dependent.  If you had a selective enough
where clause then postgresql would choose an index over a sequential
scan.

Your biggest mistake here is thinking the simple solution (use
indexes) is always best.  PostgreSQL uses a cost based planner that
tries to decide ahead of time what plan is going to be fastest.  The
real answer is to give it good information (i.e. run analyze
frequently enough, and have a high enough stats target for the
column(s) you're using)

That means pgsql is paying attention to how big your tables are as
well as what values are in there and what % you're going to get back.

Use explain analyze to see the differences between what the planner
expects and what it gets.  Like this part of your explain analyze
output:

       Seq Scan on foo a  (cost=0.00..11.80 rows=180 width=407) (actual
       time=0.003..0.003 rows=0 loops=1)

note that the planner expected 180 rows but got 0.  that's a sign of
poor stats.  Run analyze and you should see something closer to a
match between expected and actual rows.

Also, try putting some real data in your db, and using a where clause
(unless you really are gonna grab every single row every time...)

---------------------------(end of broadcast)---------------------------
TIP 5: don't forget to increase your free space map settings