Re: Digesting explain analyze

Jesper Krogh <jesper@xxxxxxxx> · Thu, 07 Jan 2010 07:10:28 +0100

Ron Mayer wrote:
>> ...The inner sets are on average 3.000 for
>> both id1 and id2 and a typical limit would be 100, so if I could convince
>> postgresql to not fetch all of them then I would reduce the set retrieved
>> by around 60. The dataset is quite large so the random query is not very
>> likely to be hitting the same part of the dataset again, so there is going
>> to be a fair amount of going to disk.,
> 
> If disk seeks are killing you a kinda crazy idea would be to
> duplicate the table - clustering one by (id1) and
> the other one by an index on (id2) and unioning the
> results of each.

That's doubling the disk space needs for the table. Is there any odds
that this would benefit when the intitial table significantly exceeds
available memory by itself?

> Since each of these duplicates of the table will be clustered
> by the column you're querying it on, it should just take one
> seek in each table.
> 
> Then your query could be something like
> 
>   select * from (
>     select * from t1 where id1=2067 order by evalue limit 100
>     union
>     select * from t2 where id2=2067 order by evalue limit 100
>   ) as foo order by evalue limit 100;

This is actually what I ended up with as the best performing query, just
still on a single table, because without duplication I can add index and
optimize this one by (id1,evalue) and (id2,evalue). It is still getting
killed quite a lot by disk IO. So I guess I'm up to:

1) By better disk (I need to get an estimate how large it actually is
going to get).
2) Stick with one table, but make sure to have enough activity to get a
large part of the index in the OS-cache anyway. (and add more memory if
nessesary).

The data is seeing a fair amount of growth (doubles in a couple of years
) so it is fairly hard to maintain clustering on them .. I would suspect.

Is it possible to get PG to tell me, how many rows that fits in a
disk-page. All columns are sitting in "plain" storage according to \d+
on the table.

> Hmm..  and I wonder if putting evalue into the criteria to cluster
> the tables too (i.e. cluster on id1,evalue) if you could make it
> so the limit finds the right 100 evalues first for each table....

I didnt cluster it, since clustering "locks everything".

-- 
Jesper

-- 
Sent via pgsql-performance mailing list (pgsql-performance@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance