Re: Digesting explain analyze

Matthew Wakeling <matthew@xxxxxxxxxxx> · Thu, 7 Jan 2010 11:23:54 +0000 (GMT)

On Thu, 7 Jan 2010, Jesper Krogh wrote:
If disk seeks are killing you a kinda crazy idea would be to
duplicate the table - clustering one by (id1) and
the other one by an index on (id2) and unioning the
results of each.

That's doubling the disk space needs for the table. Is there any odds
that this would benefit when the intitial table significantly exceeds
available memory by itself?

If the table already greatly exceeds the available RAM, then doubling the 
amount of data won't make a massive difference to performance. You're 
going to disc for everything anyway.

Since each of these duplicates of the table will be clustered
by the column you're querying it on, it should just take one
seek in each table.

Then your query could be something like

  select * from (
    select * from t1 where id1=2067 order by evalue limit 100
    union
    select * from t2 where id2=2067 order by evalue limit 100
  ) as foo order by evalue limit 100;

This is actually what I ended up with as the best performing query, just
still on a single table, because without duplication I can add index and
optimize this one by (id1,evalue) and (id2,evalue). It is still getting
killed quite a lot by disk IO. So I guess I'm up to:

You're kind of missing the point. The crucial step in the above suggestion 
is to cluster the table on the index. This will mean that all the rows 
that are fetched together are located together on disc, and you will no 
longer be killed by disc IO.

1) By better disk (I need to get an estimate how large it actually is
going to get).

Unless you cluster, you are still going to be limited by the rate at which 
the discs can seek. Postgres 8.4 has some improvements here for bitmap 
index scans if you have a RAID array, and set the effective_concurrency 
setting correctly.

2) Stick with one table, but make sure to have enough activity to get a
large part of the index in the OS-cache anyway. (and add more memory if
nessesary).

In order to win here, you will need to make memory at least as big as the 
commonly-accessed parts of the database. This could get expensive.

I didnt cluster it, since clustering "locks everything".

You can also test out the hypothesis by copying the table instead:

CREATE NEW TABLE test1 AS SELECT * FROM table1 ORDER BY id1;

Then create an index on id1, and test against that table. The copy will 
become out of date quickly, but it will allow you to see whether the 
performance benefit is worth it. It will also tell you how long a cluster 
will actually take, without actually locking anything.

Matthew

--
In the beginning was the word, and the word was unsigned,
and the main() {} was without form and void...

--
Sent via pgsql-performance mailing list (pgsql-performance@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance