Re: When should I worry?

Tom Allison <tom@xxxxxxxxxxx> · Sun, 10 Jun 2007 18:25:59 -0400

On Jun 10, 2007, at 2:14 PM, Joe Conway wrote:

Bill Moran wrote:
Tom Allison <tom@xxxxxxxxxxx> wrote:

If the user base gets to 100 or more, I'll be hitting a billion  
rows before too long.  I add about 70,000 rows per user per day.   
At 100 users this is 7 million rows per day.  I'll hit a billion  
in 142 days, call it six months for simplicity.

The table itself is small (two columns: bigint, int) but I'm  
wondering when I'll start to hit a knee in performance and how I  
can monitor that.  I know where I work (day job) they have Oracle  
tables with a billion rows that just plain suck.   I don't know  
if a billion is bad or if the DBA's were not given the right  
opportunity to make their tables work.

But if they are any indication, I'll feeling some hurt when I  
exceed a billion rows.  Am I going to just fold up and die in six  
months?

Alot depends on your specific use case.

- Will you be just storing the data for archival purposes, or  
frequently querying the data?

- If you need to run queries, are they well bounded to certain  
subsets of the data (e.g. a particular range of time for a  
particular user) or are they aggregates across the entire billion  
rows?

- Is the data temporal in nature, and if so do you need to purge it  
after some period of time?

As an example, I have an application with temporal data, that needs  
periodic purging, and is typically queried for small time ranges  
(tens of minutes). We have set up partitioned tables (partitioned  
by date range and data source -- akin to your users) using  
constraint exclusion that contain 3 or 4 billion rows (total of all  
partitions), and we have no problem at all with performance. But I  
suspect that if we needed to do an aggregate across the entire  
thing it would not be particularly fast ;-)

Why not just create a simulation of 100 users and run it as hard  
as your
can until it starts to degrade?  Then you'll have some real-world  
experience
to tell you how much you can handle.

This is good advice. Without much more detail, folks on the list  
won't be able to help much, but a with simulation such as this you  
can answer your own question...

Joe

Good questions.  I guess there are two answers.  There are times when  
I will want aggregate data and I'm not as concerned about the  
execution time.
But there are other queries that are part of the application design.   
These are always going to be of a type where I know a single specific  
primary key value and I want to find all the rows that are related.

First table has a row of
idx serial primary key
Third table has a row of
idx bigserial primary key

and a second table (the billion row table) consistes of two rows:
first_idx integer not null references first(idx) on delete cascade,
third_idx bigint not null references third(idx) on delete cascade,
constraint pkey_first_third primary key (first_idx, third_idx)

The common query will be:

select t.string
from first f, second s, third t
where f.idx = s.first_idx
and s.third_idx = t.idx
and f.idx = 4 (or whatever...).

So, I think the answer is that the data isn't going to be temporal or  
otherwise segragated or subsets.
I'll assume this is a lead in for partitions?
The data will be queried very frequently.  Probably plan on a query  
every 10 seconds and I don't know what idx ranges will be involved.
Would it be possible to partition this by the first_idx value?  An  
improvement?