I'm currently benchmarking several RDBMSs with respect to
analytical query performance on medium-sized multidimensional data sets. The
data set contains 30,000,000 fact rows evenly distributed in a multidimensional
space of 9 hierarchical dimensions. Each dimension has 8000 members. The test query selects about one half of the members from
each dimension, and calculates fact sums grouped by 5 high-level members from
each dimensional hierarchy. (There are actually some additional complications
that makes the query end up listing 20 table aliases in the from-clause, 18 of
which are aliases for 2 physical tables.) On Oracle the query runs in less than 3 seconds. All steps
have been taken to ensure that Oracle will apply star schema optimization to
the query (e.g. having loads of single-column bitmap indexes). The query plan
reveals that a bitmap merge takes place before fact lookup. There's a lot of RAM available, and series of queries have
been run in advance to make sure the required data resides in the cache. This
is confirmed by a very high CPU utilization and virtually no I/O during the
query execution. I have established similar conditions for the query in
PostgreSQL, and it runs in about 30 seconds. Again the CPU utilization is high
with no noticable I/O. The query plan is of course very different from that of
Oracle, since PostgreSQL lacks the bitmap index merge operation. It narrows
down the result one dimension at a time, using the single-column indexes
provided. It is not an option for us to provide multi-column indexes tailored
to the specific query, since we want full freedom as to which dimensions each
query will use. Are these the results we should expect when comparing
PostgreSQL to Oracle for such queries, or are there special optimization options
for PostgreSQL that we may have overlooked? (I wouldn't be suprised if there
are, since I spent at least 2 full days trying to trigger the star optimization
magic in my Oracle installation.) |