Re: Postgres for a "data warehouse", 5-10 TB

Igor Chudov <ichudov@xxxxxxxxx> · Sun, 11 Sep 2011 18:16:36 -0500

On Sun, Sep 11, 2011 at 6:01 PM, Stephen Frost <sfrost@xxxxxxxxxxx> wrote:

* Igor Chudov (ichudov@xxxxxxxxx) wrote:

> Right now I have a personal (one user) project to create a 5-10

> Terabyte data warehouse. The largest table will consume the most space

> and will take, perhaps, 200,000,000 rows.

I run data-warehouse databases on that order (current largest single

instance is ~4TB running under 9.0.4).  If the largest table is only

200M rows, PG should handle that quite well.  Our data is partitioned by

month and each month is about 200M records and simple queries can run in

15-20 minutes (with a single thread), with complex windowing queries

(split up and run in parallel) finishing in a couple of hours.

Which brings up a question. 

Can I partition data by month (or quarter), without that month being part of PRIMARY KEY?

If this question sounds weird, I am asking because MySQL enforces this, which does not fit my data. 

If I can keep my primary key to be the ID that I want (which comes with data), but still partition it by month, I will be EXTREMELY happy.

> However, while an hour is fine, two weeks per query is NOT fine.

What's really, really, really useful are two things: EXPLAIN, and this

mailing list. :)  Seriously, run EXPLAIN on your queries before you run

them and see if how the query is going to be executed makes sense.

Here's a real easy hint: if it says "External Sort" and has big numbers,

come talk to us here- that's about one of the worst things you can

possibly do.  Of course, PG's going to avoid doing that, but you may

have written a query (unintentionally) which forces PG to do a sort, or

something else.

Very good, thanks

> I have a server with about 18 TB of storage and 48 GB of RAM, and 12

> CPU cores.

If you partition up your data and don't mind things running in different

transactions, you can definitely get a speed boost with PG by running

things in parallel.  PG will handle that very well, in fact, if two

queries are running against the same table, PG will actually combine

them and only actually read the data from disk once.

> I cannot shell out $47,000 per CPU for Oracle for this project.

The above data warehouse was migrated from an Oracle-based system. :)

I am wondering, why?

> To be more specific, the batch queries that I would do, I hope,

> would either use small JOINS of a small dataset to a large dataset, or

> just SELECTS from one big table.

Make sure that you set your 'work_mem' correctly- PG will use that to

figure out if it can hash the small table (you want that to happen,

trust me..).  If you do end up having sorts, it'll also use the work_mem

value to figure out how much memory to use for sorting.

I could, say, set work_mem to 30 GB? (64 bit linux)

> So... Can Postgres support a 5-10 TB database with the use pattern

> stated above?

Yes, certainly.

that's great to know.

i

        Thanks,

                Stephen

-----BEGIN PGP SIGNATURE-----

Version: GnuPG v1.4.10 (GNU/Linux)

iEYEARECAAYFAk5tPc8ACgkQrzgMPqB3kigtSgCffwEmi3AD6Ryff7qZyQYieyKQ

jhoAoJDFC1snQmwCIBUjwlC6WVRyAOkn

=LPtP

-----END PGP SIGNATURE-----