Title: Re: [PERFORM] Querying distinct values from a large table
Tom,
On 1/30/07 9:55 PM, "Tom Lane" <tgl@xxxxxxxxxxxxx> wrote:
> Alvaro Herrera <alvherre@xxxxxxxxxxxxxxxxx> writes:
>> Gregory Stark wrote:
>>> (Incidentally I'm not sure where 2-5x comes from. It's entirely dependant on
>>> your data distribution. It's not hard to come up with distributions where
>>> it's
>>> 1000x as fast and others where there's no speed difference.)
>
>> So the figure is really "1-1000x"? I bet this one is more impressive in
>> PHB terms.
>
> Luke has a bad habit of quoting numbers that are obviously derived from
> narrow benchmarking scenarios as Universal Truths, rather than providing
> the context they were derived in. I wish he'd stop doing that...
In this case I was referring to results obtained using grouping and distinct optimizations within sort where the benefit is from the use of a single pass instead of the multiple merge passes for external sort followed by a UNIQUE operator. In this case, the benefit ranges from 2-5x in many examples as I mentioned: "from 2-5 times faster than a typical external sort". This is also the same range of benefits we see for this optimization with a popular commercial database. With the limit/sort optimization we have seen more dramatic results, but I think those are less typical cases.
Here are some results for a 1GB table and a simple COUNT(DISTINCT) on a column with 7 unique values from my dual CPU laptop running Greenplum DB (PG 8.2.1 compatible) on both CPUs. Note that my laptop has 2GB of RAM so I have the 1GB table loaded into OS I/O cache. The unmodified external sort spills the sorted attribute to disk, but that takes little time. Note that the COUNT(DISTINCT) plan embeds a sort as the transition function in the aggregation node.
=================================================================
================= No Distinct Optimization in Sort ==============
=================================================================
lukelonergan=# select count(distinct l_shipmode) from lineitem;
count
-------
7
(1 row)
Time: 37832.308 ms
lukelonergan=# explain analyze select count(distinct l_shipmode) from lineitem; QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------------------------------
Aggregate (cost=159175.30..159175.31 rows=1 width=8)
Total 1 rows with 40899 ms to end, start offset by 3.189 ms.
-> Gather Motion 2:1 (slice2) (cost=159175.25..159175.28 rows=1 width=8)
recv: Total 2 rows with 39387 ms to first row, 40899 ms to end, start offset by 3.191 ms.
-> Aggregate (cost=159175.25..159175.26 rows=1 width=8)
Avg 1.00 rows x 2 workers. Max 1 rows (seg0) with 39367 ms to end, start offset by 22 ms.
-> Redistribute Motion 2:2 (slice1) (cost=0.00..151672.00 rows=3001300 width=8)
recv: Avg 3000607.50 rows x 2 workers. Max 3429492 rows (seg1) with 0.362 ms to first row, 8643 ms to end, start offset by 23 ms.
Hash Key: lineitem.l_shipmode
-> Seq Scan on lineitem (cost=0.00..91646.00 rows=3001300 width=8)
Avg 3000607.50 rows x 2 workers. Max 3001300 rows (seg0) with 0.049 ms to first row, 2813 ms to end, start offset by 12.998 ms.
Total runtime: 40903.321 ms
(12 rows)
Time: 40904.013 ms
=================================================================
================= With Distinct Optimization in Sort ==============
=================================================================
lukelonergan=# set mpp_sort_flags=1;
SET
Time: 1.425 ms
lukelonergan=# select count(distinct l_shipmode) from lineitem;
count
-------
7
(1 row)
Time: 12846.466 ms
lukelonergan=# explain analyze select count(distinct l_shipmode) from lineitem;
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------------------------------
Aggregate (cost=159175.30..159175.31 rows=1 width=8)
Total 1 rows with 13754 ms to end, start offset by 2.998 ms.
-> Gather Motion 2:1 (slice2) (cost=159175.25..159175.28 rows=1 width=8)
recv: Total 2 rows with 13754 ms to end, start offset by 3.000 ms.
-> Aggregate (cost=159175.25..159175.26 rows=1 width=8)
Avg 1.00 rows x 2 workers. Max 1 rows (seg0) with 13734 ms to end, start offset by 23 ms.
-> Redistribute Motion 2:2 (slice1) (cost=0.00..151672.00 rows=3001300 width=8)
recv: Avg 3000607.50 rows x 2 workers. Max 3429492 rows (seg1) with 0.352 ms to first row, 10145 ms to end, start offset by 26 ms.
Hash Key: lineitem.l_shipmode
-> Seq Scan on lineitem (cost=0.00..91646.00 rows=3001300 width=8)
Avg 3000607.50 rows x 2 workers. Max 3001300 rows (seg0) with 0.032 ms to first row, 4048 ms to end, start offset by 13.037 ms.
Total runtime: 13757.524 ms
(12 rows)
Time: 13758.182 ms
================= Background Information ==============
lukelonergan=# select count(*) from lineitem;
count
---------
6001215
(1 row)
Time: 1661.337 ms
lukelonergan=# \d lineitem
Table "public.lineitem"
Column | Type | Modifiers
-----------------+-----------------------+-----------
l_orderkey | integer | not null
l_partkey | integer | not null
l_suppkey | integer | not null
l_linenumber | integer | not null
l_quantity | double precision | not null
l_extendedprice | double precision | not null
l_discount | double precision | not null
l_tax | double precision | not null
l_returnflag | text | not null
l_linestatus | text | not null
l_shipdate | date | not null
l_commitdate | date | not null
l_receiptdate | date | not null
l_shipinstruct | text | not null
l_shipmode | text | not null
l_comment | character varying(44) | not null
Distributed by: (l_orderkey)
lukelonergan=# select pg_relation_size(oid)/(1000.*1000.) as MB, relname as Table from pg_class order by MB desc limit 10;
mb | table
------------------------+--------------------------------
1009.4755840000000000 | lineitem
230.0559360000000000 | orders
146.7678720000000000 | partsupp
35.8072320000000000 | part
32.8908800000000000 | customer
1.9333120000000000 | supplier
1.1304960000000000 | pg_proc
0.88473600000000000000 | pg_proc_proname_args_nsp_index
0.81100800000000000000 | pg_attribute
0.81100800000000000000 | pg_depend
- Luke