Re: Querying distinct values from a large table

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Title: Re: [PERFORM] Querying distinct values from a large table
Tom,

On 1/30/07 9:55 PM, "Tom Lane" <tgl@xxxxxxxxxxxxx> wrote:

> Alvaro Herrera <alvherre@xxxxxxxxxxxxxxxxx> writes:
>> Gregory Stark wrote:
>>> (Incidentally I'm not sure where 2-5x comes from. It's entirely dependant on
>>> your data distribution. It's not hard to come up with distributions where
>>> it's
>>> 1000x as fast and others where there's no speed difference.)
>
>> So the figure is really "1-1000x"?  I bet this one is more impressive in
>> PHB terms.
>
> Luke has a bad habit of quoting numbers that are obviously derived from
> narrow benchmarking scenarios as Universal Truths, rather than providing
> the context they were derived in.  I wish he'd stop doing that...

In this case I was referring to results obtained using grouping and distinct optimizations within sort where the benefit is from the use of a single pass instead of the multiple merge passes for external sort followed by a UNIQUE operator.  In this case, the benefit ranges from 2-5x in many examples as I mentioned: "from 2-5 times faster than a typical external sort".  This is also the same range of benefits we see for this optimization with a popular commercial database. With the limit/sort optimization we have seen more dramatic results, but I think those are less typical cases.

Here are some results for a 1GB table and a simple COUNT(DISTINCT) on a column with 7 unique values from my dual CPU laptop running Greenplum DB (PG 8.2.1 compatible) on both CPUs. Note that my laptop has 2GB of RAM so I have the 1GB table loaded into OS I/O cache.  The unmodified external sort spills the sorted attribute to disk, but that takes little time.  Note that the COUNT(DISTINCT) plan embeds a sort as the transition function in the aggregation node.

=================================================================
================= No Distinct Optimization in Sort ==============
=================================================================
lukelonergan=# select count(distinct l_shipmode) from lineitem;
 count
-------
     7
(1 row)

Time: 37832.308 ms

lukelonergan=# explain analyze select count(distinct l_shipmode) from lineitem;                                                                                                            QUERY PLAN                                                                         
------------------------------------------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=159175.30..159175.31 rows=1 width=8)
   Total 1 rows with 40899 ms to end, start offset by 3.189 ms.
   ->  Gather Motion 2:1  (slice2)  (cost=159175.25..159175.28 rows=1 width=8)
         recv:  Total 2 rows with 39387 ms to first row, 40899 ms to end, start offset by 3.191 ms.
         ->  Aggregate  (cost=159175.25..159175.26 rows=1 width=8)
               Avg 1.00 rows x 2 workers.  Max 1 rows (seg0) with 39367 ms to end, start offset by 22 ms.
               ->  Redistribute Motion 2:2  (slice1)  (cost=0.00..151672.00 rows=3001300 width=8)
                     recv:  Avg 3000607.50 rows x 2 workers.  Max 3429492 rows (seg1) with 0.362 ms to first row, 8643 ms to end, start offset by 23 ms.
                     Hash Key: lineitem.l_shipmode
                     ->  Seq Scan on lineitem  (cost=0.00..91646.00 rows=3001300 width=8)
                           Avg 3000607.50 rows x 2 workers.  Max 3001300 rows (seg0) with 0.049 ms to first row, 2813 ms to end, start offset by 12.998 ms.
 Total runtime: 40903.321 ms
(12 rows)

Time: 40904.013 ms

=================================================================
================= With Distinct Optimization in Sort ==============
=================================================================
lukelonergan=# set mpp_sort_flags=1;
SET
Time: 1.425 ms
lukelonergan=# select count(distinct l_shipmode) from lineitem;
 count
-------
     7
(1 row)

Time: 12846.466 ms

lukelonergan=# explain analyze select count(distinct l_shipmode) from lineitem;
                                                                         QUERY PLAN                                                                         
------------------------------------------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=159175.30..159175.31 rows=1 width=8)
   Total 1 rows with 13754 ms to end, start offset by 2.998 ms.
   ->  Gather Motion 2:1  (slice2)  (cost=159175.25..159175.28 rows=1 width=8)
         recv:  Total 2 rows with 13754 ms to end, start offset by 3.000 ms.
         ->  Aggregate  (cost=159175.25..159175.26 rows=1 width=8)
               Avg 1.00 rows x 2 workers.  Max 1 rows (seg0) with 13734 ms to end, start offset by 23 ms.
               ->  Redistribute Motion 2:2  (slice1)  (cost=0.00..151672.00 rows=3001300 width=8)
                     recv:  Avg 3000607.50 rows x 2 workers.  Max 3429492 rows (seg1) with 0.352 ms to first row, 10145 ms to end, start offset by 26 ms.
                     Hash Key: lineitem.l_shipmode
                     ->  Seq Scan on lineitem  (cost=0.00..91646.00 rows=3001300 width=8)
                           Avg 3000607.50 rows x 2 workers.  Max 3001300 rows (seg0) with 0.032 ms to first row, 4048 ms to end, start offset by 13.037 ms.
 Total runtime: 13757.524 ms
(12 rows)

Time: 13758.182 ms

================= Background Information ==============
lukelonergan=# select count(*) from lineitem;
  count  
---------
 6001215
(1 row)

Time: 1661.337 ms
lukelonergan=# \d lineitem
               Table "public.lineitem"
     Column      |         Type          | Modifiers
-----------------+-----------------------+-----------
 l_orderkey      | integer               | not null
 l_partkey       | integer               | not null
 l_suppkey       | integer               | not null
 l_linenumber    | integer               | not null
 l_quantity      | double precision      | not null
 l_extendedprice | double precision      | not null
 l_discount      | double precision      | not null
 l_tax           | double precision      | not null
 l_returnflag    | text                  | not null
 l_linestatus    | text                  | not null
 l_shipdate      | date                  | not null
 l_commitdate    | date                  | not null
 l_receiptdate   | date                  | not null
 l_shipinstruct  | text                  | not null
 l_shipmode      | text                  | not null
 l_comment       | character varying(44) | not null
Distributed by: (l_orderkey)

lukelonergan=# select pg_relation_size(oid)/(1000.*1000.) as MB, relname as Table from pg_class order by MB desc limit 10;
           mb           |             table              
------------------------+--------------------------------
  1009.4755840000000000 | lineitem
   230.0559360000000000 | orders
   146.7678720000000000 | partsupp
    35.8072320000000000 | part
    32.8908800000000000 | customer
     1.9333120000000000 | supplier
     1.1304960000000000 | pg_proc
 0.88473600000000000000 | pg_proc_proname_args_nsp_index
 0.81100800000000000000 | pg_attribute
 0.81100800000000000000 | pg_depend


- Luke  

[Postgresql General]     [Postgresql PHP]     [PHP Users]     [PHP Home]     [PHP on Windows]     [Kernel Newbies]     [PHP Classes]     [PHP Books]     [PHP Databases]     [Yosemite]

  Powered by Linux