Re: Querying distinct values from a large table

"Luke Lonergan" <llonergan@xxxxxxxxxxxxx> · Tue, 30 Jan 2007 08:34:03 -0800

Chad,

On 1/30/07 7:03 AM, "Chad Wagner" <chad.wagner@xxxxxxxxx> wrote:

> On 1/30/07, Luke Lonergan <llonergan@xxxxxxxxxxxxx> wrote:
>> Not that it helps Igor, but we've implemented single pass sort/unique,
>> grouping and limit optimizations and it speeds things up to a single seqscan
>> over the data, from 2-5 times faster than a typical external sort.
> 
> Was that integrated back into PostgreSQL, or is that part of Greenplum's
> offering? 

Not yet, we will submit to PostgreSQL along with other executor node
enhancements like hybrid hash agg (fixes the memory overflow problem with
hash agg) and some other great sort work.  These are all "cooked" and in the
Greenplum DBMS, and have proven themselves significant on customer workloads
with tens of terabytes already.

For now it seems that the "Group By" trick Brian suggested in this thread
combined with lots of work_mem may speed things up for this case if HashAgg
is chosen.  Watch out for misestimates of stats though - hash agg may
overallocate RAM in some cases.

>> I can't think of a way that indexing would help this situation given the
>> required visibility check of each tuple.
> 
> I agree, using indexes as a "skinny" table is a whole other feature that would
> be nice. 

Yah - I like Hannu's ideas to make visibility less of a problem.  We're
thinking about this too.

- Luke