Setup for large database

"mlartz@xxxxxxxxx" <mlartz@xxxxxxxxx> · 20 Apr 2006 18:02:17 -0700

Couple of questions.  I have a project that promises to generate a very
large database of network-style data (think banners, flows, etc).  I
was fortunate enough to fall into some kick ass hardware (quad Opteron,
16GB RAM, 3+ TB of fibre channel HDs).

As I'm still in the design phase, I was wondering if I could get any
decent recommendations on hardware/partition setup, and perhaps some
database sanity checks.  My use case is mostly datawarehouse-style
stuff:  scheduled bulk batch inserts and lots of queries.  Like I said
before, my rows are based on network data and are all keyed by an IP
address, and I'm hoping to keep each row under 1K.  I'm estimating
ending up with about 3TB of total data after a year of operation.

1) How anal should I be about my hardware setup?  I have about 15 300GB
10K RPM SCSI drives, 4 of which I can directly attach to the server and
the rest one the FC array.  Should I just put the OS and transaction
logs on the direct attached storage and and then RAID10 the rest of
them and be done, or would I significantly benefit from separating out
the indexes and partitioning across tablespaces across drives?  Would
RAID5 across 10+ drives yield acceptable performance numbers?

3) I've currently installed RHEL4 AS for my OS, which I am very
comfortable with.  I was going to go with EXT3 on everything (noatime)
... sound good?

2) Assuming that my data is roughly evenly distributed among IP
addresses, I figured that a naive partitioning based on the first octet
of the IP (i.e. ~255 partitions) would suffice for such a table, making
each partition ~12GB and keeping the IPs clustered to easily to quickly
query network blocks.  Would it be wise to go to even more partitions?
How does Pg do under a *lot* of partitions (655356)?  Would it be wise
to put each partition in a separate tablespace?

3) I guess I don't quite understand Bizgres.  At the moment, it seems
to be just a development beta of Postgres ... is this true?  I realize
that the focus is on BI/ETL stuff, but the current improvements seem to
benefit Postgres as a whole.  Is there currently or can you imagine a
case where a feature in Bizgres won't get integrated into Postgres?
How significant is the fork between Bizgres and Postgres?  I've also
considered taking a look at Bizgres MPP.  I know that its the wrong
forum, but any comments?

4) Not to start any sort of flame war, but my company has an Oracle
license and there are a bunch of people wanting me to go that way.
I've been doing just fine with Postgres at the moment and am quite
comfortable with it, but am being pressured to go with our Oracle
license.  Cost (and prejudices) aside, do you think it would be wise to
go with Oracle to begin with, considering the size of the database that
I'm planning?

Thanks for any comments,
-Mike