Re: General data warehousing questions

"Scott Marlowe" <scott.marlowe@xxxxxxxxx> · Sun, 5 Oct 2008 20:07:00 -0600

On Sun, Oct 5, 2008 at 7:48 PM, Sean Davis <sdavis2@xxxxxxxxxxxx> wrote:
> I am looking at the prospect of building a data warehouse of genomic
> sequence data.  The machine that produces the data adds about
> 300million rows per month in a central fact table and we will
> generally want the data to be "online".  We don't need instantaneous
> queries, but we would be using the data for data mining purposes and
> running some "real-time" queries for reporting and research purposes.
> I have had the pleasure of working on an Netezza box where this type
> of thing is quite standard, but we don't have that access anymore, so
> I'm looking for hints on using postgres in a data warehousing/mining
> environment.  Any suggestions on how DDL, loading, backup, indexing,
> or (to a certain extent) hardware?

I assume you're familiar with stuff like star schemas.

For loading you might want to look at things like pg_bulkloader, copy,

For indexing remember that you have partial and fuctional indexes in
postgresql and they can come in quite handy.

For backup of large changing databases look into PITR.

As for hardware, you need enough CPU horsepower and memory to handle
however many users you're gonna have running simultaneous queries, but
more important is usually the drive subsystem.  Throwing drives,
battery backed cache and a good RAID controller can make a big
difference.  Usual RAID-10 is preferred, as writes are much faster. If
you're really squeezed for space and money then you can use RAID-5 but
it has some seriously negative performance implications for parallel
load handling and write speed.