Re: SELECT DISTINCT very slow

Greg Stark <gsstark@xxxxxxx> · Fri, 10 Jul 2009 13:58:05 +0100

On Fri, Jul 10, 2009 at 1:41 PM, Ben Harper<rogojin@xxxxxxxxx> wrote:
>
> Unfortunately I can't use GROUP BY, because what I'm really doing is
> SELECT DISTINCT ON(unique_field) id FROM table;

You could do that using GROUP BY if you define a first() aggregate. In
this case that would just be SELECT first(id) AS id from (select *
from table ORDER BY unique_field, ...) GROUP BY unique_field. In cases
with more fields it gets tiresome fast.

In this case 8.4 won't actually help you. It only uses hash aggregates
for DISTINCT not DISTINCT ON.

> I'm not familiar with the Postgres internals, but in my own DB system
> that I have written, I do the skip-scanning thing, and for my system
> it was a really trivial optimization to code.

Well things get tricky quickly when you have to deal with concurrent
inserts and potentially page splits from other transactions. Also
consider how hard it is to prove that the query falls into this
category of queries.

> Inside a GIS application, the user wants to categorize the display of
> some information based on, in this case, the suburb name.
> He clicks a button that says "Add All Unique Categories". This is a
> very common operation in this domain.

That doesn't look like what this query is doing to me. It's taking one
exemplar from each suburb based on some other constraint (the minimum
of whatever your order by key is) and taking the id of that data
point. If the order by key doesn't specify a specific data point then
it's a non-deterministic record.

-- 
greg
http://mit.edu/~gsstark/resume.pdf

-- 
Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general