HashAggregate slower than sort?

"Jatinder Sangha" <js@xxxxxxxxxxxxx> · Thu, 17 Jun 2010 17:57:15 +0100

Hi,

I've noticed something that I find strange with the hash-aggregate
feature of Postgres. I'm currently running Postgres v8.4.1 on Debian
Linux 64-bit.

I have a simple query that when planned either uses hash-aggregates or a
sort depending on the amount of working memory available. The problem is
that when it uses the hash-aggregates, the query runs 25% slower than
when using the sort method.

The table in question contains about 60 columns, many of which are
boolean, 32-bit integers and some are 64-bit integers. Many fields are
text - and some of these can be quite long (eg 32Kb).

The SQL is as follows:

explain analyse
select distinct T1.*
  from role T1
 where T1.endDate is null and T1.latest=true and T1.active=true and
       T1.deceased=false and T1.desk in (BIG LIST OF INTEGERS);

select version() --> "PostgreSQL 8.4.1 on x86_64-unknown-linux-gnu,
compiled by GCC gcc (GCC) 3.4.6 20060404 (Red Hat 3.4.6-10), 64-bit"
show enable_hashagg --> "on"
set work_mem='8MB'
show work_mem --> "8MB"

Explain analyse of the SQL above:
Unique  (cost=47033.71..48410.27 rows=8881 width=1057) (actual
time=18.803..38.969 rows=6449 loops=1)
  ->  Sort  (cost=47033.71..47055.91 rows=8881 width=1057) (actual
time=18.801..20.560 rows=6449 loops=1)
        Sort Key: id, version, latest, active, deceased, person,
formalnotes, informalnotes, description, desk, rolelevel, roletype,
promotiondate, primaryrole, headofplace, careergrading, startdate,
enddate, percentsalary, deskf, rolelevelf, roletypef, promotiondatef,
primaryrolef, headofplacef, careergradingf, startdatef, enddatef,
percentsalaryf, descriptionf, deskmv, rolelevelmv, roletypemv,
promotiondatemv, primaryrolemv, headofplacemv, careergradingmv,
startdatemv, enddatemv, percentsalarymv, descriptionmv, hasattachments,
hasrelationships, hasprojects, audwho, audwhen, audcreated, costcentre,
reportsto, manages, startdateest, enddateest, hasstarperformers,
projectnames, sourcefrom, sourceto, checkedwho, checkedwhen,
checkednotes, hasqueries, querytitles
        Sort Method:  quicksort  Memory: 2001kB
        ->  Bitmap Heap Scan on role t1  (cost=4888.59..42321.27
rows=8881 width=1057) (actual time=7.041..12.504 rows=6449 loops=1)
              Recheck Cond: (desk = ANY ('BIG LIST OF
INTEGERS'::bigint[]))
              Filter: ((enddate IS NULL) AND latest AND active AND (NOT
deceased))
              ->  Bitmap Index Scan on role_ix2  (cost=0.00..4886.37
rows=10984 width=0) (actual time=6.948..6.948 rows=9296 loops=1)
                    Index Cond: ((latest = true) AND (active = true) AND
(deceased = false) AND (desk = ANY ('BIG LIST OF INTEGERS'::bigint[])))
Total runtime: 40.777 ms

This execution of the query used a sort to perform the "distinct".

Now for the second run:

select version() --> "PostgreSQL 8.4.1 on x86_64-unknown-linux-gnu,
compiled by GCC gcc (GCC) 3.4.6 20060404 (Red Hat 3.4.6-10), 64-bit"
show enable_hashagg --> "on"
set work_mem='64MB'
show work_mem --> "64MB"

Explain analyse of the SQL above:
HashAggregate  (cost=43675.63..43764.44 rows=8881 width=1057) (actual
time=46.556..55.694 rows=6449 loops=1)
  ->  Bitmap Heap Scan on role t1  (cost=4888.59..42321.27 rows=8881
width=1057) (actual time=7.179..13.023 rows=6449 loops=1)
        Recheck Cond: (desk = ANY ('BIG LIST OF INTEGERS'::bigint[]))
        Filter: ((enddate IS NULL) AND latest AND active AND (NOT
deceased))
        ->  Bitmap Index Scan on role_ix2  (cost=0.00..4886.37
rows=10984 width=0) (actual time=7.086..7.086 rows=9296 loops=1)
              Index Cond: ((latest = true) AND (active = true) AND
(deceased = false) AND (desk = ANY ('BIG LIST OF INTEGERS'::bigint[])))
Total runtime: 57.536 ms

I've tested this with v8.4.4 as well with the same results. I also
tested the same query with our previous production version of Postgres
(v8.3.8) and that version only appears to use sorting not
hash-aggregates.

Obviously, I can re-write the query to use a "distinct on (...)" clause
to improve performance - which is what I've done, but my question is:
Why is the hash-aggregate slower than the sort?

Is it something to do with the number of columns? ie. When sorting, the
first few columns defined on the table (id, version) make the row unique
- but when using the hash-aggregate feature, presumably every column
needs to be hashed which takes longer especially for long text fields?

Thanks,
--Jatinder

-- 
Sent via pgsql-performance mailing list (pgsql-performance@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance