Doing it at low scales is not attractive.
Commercial databases are publishing at scale factor of 1000(about 1TB)
to 10000(10TB) with one in 30TB space. So ideally right now tuning
should start at 1000 scale factor.
Unfortunately I have tried that before with PostgreSQL the few of the
problems are as follows:
Single stream loader of PostgreSQL takes hours to load data. (Single
stream load... wasting all the extra cores out there)
Multiple table loads ( 1 per table) spawned via script is bit better
but hits wal problems.
To avoid wal problems, I had created tables and load statements within
the same transaction, faster but cannot create index before load or it
starts writing to wal... AND if indexes are created after load, it takes
about a day or so to create all the indices required. (Its single
threaded and creating multiple indexes/indices at the same time could
result in running out of temporary "DISK" space since the tables are so
big. Which means 1 thread at a time is the answer for creating tables
that are really big. It is slow.
Boy, by this time most people doing TPC-H in high end give up on
PostgreSQL.
I have not even started Partitioning of tables yet since with the
current framework, you have to load the tables separately into each
tables which means for the TPC-H data you need "extra-logic" to take
that table data and split it into each partition child table. Not stuff
that many people want to do by hand.
Then for the power run that is essentially running one query at a time
should essentially be able to utilize the full system (specially
multi-core systems), unfortunately PostgreSQL can use only one core.
(Plus since this is read only and there is no separate disk reader all
other processes are idle) and system is running at 1/Nth capacity (where
N is the number of cores/threads)
(I am not sure here with Partitioned tables, do you get N processes
running in the system when you scan the partitioned table?)
Even off-loading work like "fetching the data into bufferpool" into
separate processes will go big time with this type of workloads.
I would be happy to help out if folks here want to do work related to
it. Infact if you have time, I can request a project in one of the Sun
Benchmarking center to see what we can learn with community members
interested in understanding where PostgreSQL performs and fails.
Regards,
Jignesh
Greg Smith wrote:
On Mon, 4 Feb 2008, Simon Riggs wrote:
Would anybody like to repeat these tests with the latest production
versions of these databases (i.e. with PGSQL 8.3)
Do you have any suggestions on how people should run TPC-H? It looked
like a bit of work to sort through how to even start this exercise.
--
* Greg Smith gsmith@xxxxxxxxxxxxx http://www.gregsmith.com Baltimore, MD
---------------------------(end of broadcast)---------------------------
TIP 4: Have you searched our list archives?
http://archives.postgresql.org
---------------------------(end of broadcast)---------------------------
TIP 9: In versions below 8.0, the planner will ignore your desire to
choose an index scan if your joining column's datatypes do not
match