On Sun, Sep 11, 2011 at 6:01 PM, Stephen Frost <sfrost@xxxxxxxxxxx> wrote:
* Igor Chudov (ichudov@xxxxxxxxx) wrote:
> Right now I have a personal (one user) project to create a 5-10I run data-warehouse databases on that order (current largest single
> Terabyte data warehouse. The largest table will consume the most space
> and will take, perhaps, 200,000,000 rows.
instance is ~4TB running under 9.0.4). If the largest table is only
200M rows, PG should handle that quite well. Our data is partitioned by
month and each month is about 200M records and simple queries can run in
15-20 minutes (with a single thread), with complex windowing queries
(split up and run in parallel) finishing in a couple of hours.
Which brings up a question.
Can I partition data by month (or quarter), without that month being part of PRIMARY KEY?
If this question sounds weird, I am asking because MySQL enforces this, which does not fit my data.
If I can keep my primary key to be the ID that I want (which comes with data), but still partition it by month, I will be EXTREMELY happy.
> However, while an hour is fine, two weeks per query is NOT fine.What's really, really, really useful are two things: EXPLAIN, and this
mailing list. :) Seriously, run EXPLAIN on your queries before you run
them and see if how the query is going to be executed makes sense.
Here's a real easy hint: if it says "External Sort" and has big numbers,
come talk to us here- that's about one of the worst things you can
possibly do. Of course, PG's going to avoid doing that, but you may
have written a query (unintentionally) which forces PG to do a sort, or
something else.
Very good, thanks
> I have a server with about 18 TB of storage and 48 GB of RAM, and 12If you partition up your data and don't mind things running in different
> CPU cores.
transactions, you can definitely get a speed boost with PG by running
things in parallel. PG will handle that very well, in fact, if two
queries are running against the same table, PG will actually combine
them and only actually read the data from disk once.
The above data warehouse was migrated from an Oracle-based system. :)
> I cannot shell out $47,000 per CPU for Oracle for this project.
I am wondering, why?
> To be more specific, the batch queries that I would do, I hope,Make sure that you set your 'work_mem' correctly- PG will use that to
> would either use small JOINS of a small dataset to a large dataset, or
> just SELECTS from one big table.
figure out if it can hash the small table (you want that to happen,
trust me..). If you do end up having sorts, it'll also use the work_mem
value to figure out how much memory to use for sorting.
I could, say, set work_mem to 30 GB? (64 bit linux)
> So... Can Postgres support a 5-10 TB database with the use patternYes, certainly.
> stated above?
that's great to know.
i
Thanks,
Stephen
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.10 (GNU/Linux)
iEYEARECAAYFAk5tPc8ACgkQrzgMPqB3kigtSgCffwEmi3AD6Ryff7qZyQYieyKQ
jhoAoJDFC1snQmwCIBUjwlC6WVRyAOkn
=LPtP
-----END PGP SIGNATURE-----