Re: Querying distinct values from a large table

Brian Herlihy <btherl@xxxxxxxxxxxx> · Tue, 30 Jan 2007 06:38:11 -0800 (PST)

As I understand, there's no hashing for DISTINCT, but there is for GROUP BY.  GROUP BY will choose between a hash and a sort (or maybe other options?) depending on the circumstances.  So you can write

SELECT a, b FROM tbl GROUP BY a,b

and the sort/unique part of the query may run faster.

Brian

----- Original Message ----
From: Chad Wagner <chad.wagner@xxxxxxxxx>
To: Simon Riggs <simon@xxxxxxxxxxxxxxx>
Cc: Igor Lobanov <ilobanov@xxxxxxxxxx>; Richard Huxton <dev@xxxxxxxxxxxx>; pgsql-performance@xxxxxxxxxxxxxx
Sent: Tuesday, 30 January, 2007 10:13:27 PM
Subject: Re: [PERFORM]
 Querying distinct values from a large table

On 1/30/07, Simon Riggs <simon@xxxxxxxxxxxxxxx> wrote:
> explain analyze select distinct a, b from tbl
>
> EXPLAIN ANALYZE output is:
>
>   Unique  (cost=500327.32..525646.88 rows=1848 width=6) (actual
> time=52719.868..56126.356 rows=5390 loops=1)

>     ->  Sort  (cost=500327.32..508767.17 rows=3375941 width=6) (actual
> time=52719.865..54919.989 rows=3378864 loops=1)
>           Sort Key: a, b
>           ->  Seq Scan on tbl  (cost=0.00..101216.41
 rows=3375941
> width=6) (actual time=16.643..20652.610 rows=3378864 loops=1)
>   Total runtime: 57307.394 ms

All your time is in the sort, not in the SeqScan.

Increase your work_mem.

Sounds like an opportunity to implement a "Sort Unique" (sort of like a hash, I guess), there is no need to push 3M rows through a sort algorithm to only shave it down to 1848 unique records.

I am assuming this optimization just isn't implemented in PostgreSQL?

-- 
Chad
http://www.postgresqlforums.com/