Re: Postgresql 9.6 and Big Data

Stephen Frost <sfrost@xxxxxxxxxxx> · Fri, 2 Dec 2016 08:15:09 -0500

Job,

* Job (Job@xxxxxxxxxxxxxxxxxxxx) wrote:
> we are planning to store historically data into a Postgresql 9.6 table.

The question is less about what you're storing in PG and more about what
you're going to be doing with that data.

> We see on Postgresql limit that it could handle "big data".
> In fact, limit for a single table is 32 Tb.

Very few people would advocate storing a single, non-partitioned, 32TB
table in any database system.  As with most other database technologies,
you'll want to partition up your data.  The general rule of thumb is to
partition at the 100s-of-millions of tuples level which tends to lead to
tables which are somewhere between 5G and 100G.  At 100G per table, you
would end up with a few hundred tables to get up to 32TB, which is quite
managable in PG.  Indeed, PG can support many more tables than that, but
putting more than a few hundred into a single inheritance structure
using constraint exclusion for the partitioning can lead to long
planning times.

> We need to archive this data to generate report and analysis views.

Is the historical data changing?  Are the queries changing?  Will you
primairly be performing queries which look at all of the data?

> Anyone has experienced Postgresql 9.6 with Database and table bigger than four or five Terabytes?

Yes.

> Which hardware requirements has been used?

This depends a great deal on what you're actually doing.

> There were some problems or bottleneck with so big data?

If you're looking to run queries against 4 or 5 TBs at a time where all
of the tuples have to be considered (no index-based lookups, etc), then
you'll want to be doing parallel work.  With 9.6, you could try using
the parallel query capability, but even with 9.5 and earlier you could
pretty trivially write your own parallel query system by using multiple
connections to the database and it works very well.  The only downside
to using multiple queries is if you have ongoing changes happening and
you need a consistent view of the data- that's still possible to do with
multiple processes, but you have to do some prep work ahead of time to
extract out the keys for all of the records you want to process in a
given run and store them in a static side-table that the parallel
processes then use.  You have to avoid removing records that are
included in that set, of course, and depending on your requirments you
might wish to avoid updating them also. 

Thanks!

Stephen
Attachment:
signature.asc

Description: Digital signature