Postgres vs other Postgres based MPP implementations

Ondrej Ivanič <ondrej.ivanic@xxxxxxxxx> · Tue, 8 Nov 2011 12:28:47 +1100

Hi,

I have simple question (I think which is not easy to answer): why
Postgres is so slow comparing to other Postgres based MPP products
(even on the same box in single node configuration)?

I'm mot talking about multi node setup; all benchmarks were done on
single box (CentOS 5.5, 16 cores, 80GB ram, fast SAS local disks) and
Postgres is slower by order of magnitude (10 or sometimes 100 times
slower). Secondly, I've run simple selects and aggregations on
vertically partitioned star schema and I haven't used features like
columnar tables or replicated dimension tables. I believe that my
Postgres configuration is close to optimal one.

Another interesting experiment was to parallelise query by hand:
select sum(count) from T where org_id = ... and  date_in_tz >=
'2011-08-01' and date_in_tz < '2011-11-01'

The query above was revritten as series of queries like this:
create table t00 as select sum(count) from T where (... previous where
...)  and date_in_tz >= hist[1] < date_in_tz >= hist[2]
create table t01 as select sum(count) from T where (... previous where
...)  and date_in_tz >= hist[2] < date_in_tz >= hist[3]...
create table t08 as select sum(count) from T where (... previous where
...)  and date_in_tz >= hist[8] < date_in_tz >= hist[9]
create table t09 as select sum(count) from T where (... previous where
...)  and date_in_tz >= hist[9] < date_in_tz >= hist[10]

"hist" is similar to pg_stat.histogram_bounds (A list of values that
divide the column's values into groups of approximately equal
population). The idea is to slice T table by primary key (date_in_tz)
into 10 groups (or N groups) with similar row count and execute those
CTAS queries thru multiple connections.

Final query just sums:
select sum(count) from (
select count from t00
union all
...
union all
select count from t09
) as x

All above were faster than single query at the begging.

Disk activity was lower and spanned over longer timespan for original
query comparing to sliced query scenario. Maybe there is no need for
quite complex query rewrite and queries can be speed up by async
parallel reads...

-- 
Ondrej Ivanic
(ondrej.ivanic@xxxxxxxxx)

-- 
Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general