Re: 110,000,000 rows

Nikolas Everett <nik9000@xxxxxxxxx> · Thu, 27 May 2010 09:48:27 -0400

I've had a reporting database with just about a billion rows. �Each row was�horribly�large because the legacy schema had problems. �We partitioned it out by month and it ran about 30 million rows a month. �With a reasonably large box you can get that kind of data into memory and indexes are almost�unnecessary. �So long as you have constraint exclusion and a good partition scheme you should be fine. �Throw in a well designed schema and you'll be cooking well into the tens of billions of rows.

We ran self joins of that table reasonably consistently by the way:
SELECT lhs.id, rhs.id
FROM bigtable lhs, bigtable rhs

WHERE lhs.id > rhs.id�
�� � AND '' > lhs.timestamp AND lhs.timestamp >= ''
�� ��AND '' > rhs.timestamp AND rhs.timestamp >= ''

�� � AND lhs.timestamp = rhs.timestamp
�� � AND lhs.foo = rhs.foo
�� � AND lhs.bar = rhs.bar

This really liked the timestamp index and we had to be careful to only do it for a few days at a time. �It took a few minutes each go but it was definitely doable.

Once you get this large you do have to be careful with a few things though:
*It's somewhat easy to write super long queries or updates. �This can lots of dead rows in your tables. �Limit your longest running queries to a day or so. �Note that queries are unlikely to take that long but updates with massive date ranges could. �SELECT COUNT(*) FROM bigtable too about 30 minutes when the server wasn't under heavy load.

*You sometimes get bad plans because:
**You don't or can't get enough statistics about a column.
**PostgreSQL doesn't capture statistics about two columns together. �PostgreSQL has no way of knowing that columnA = 'foo' implies columnB = 'bar' about 30% of the time.

Nik

On Thu, May 27, 2010 at 5:58 AM, Massa, Harald Armin <chef@xxxxxxx> wrote:

Dann,
There really are domains that big, so that there is no more normalization or other processes to mitigate the problem.

Examples:

Microsoft's registered customers database (all MS products bought by any customer, including operating systems)

Tolls taken on the New Jersey road system for FY 2009

DNA data from the Human Genome Project

.....

please also think of ouer most risk exposed users, the ones using Poker / Roulette simulation and analyzing software with an PostgrSQL database below. There are so many rounds of Poker to play .... :)

Harald
�

-- 
GHUM Harald Massa
persuadere et programmare
Harald Armin Massa
Spielberger Stra� 49
70435 Stuttgart
0173/9409607
no fx, no carrier pigeon 

-
Using PostgreSQL is mostly about sleeping well at night.