On Jun 10, 2007, at 2:14 PM, Joe Conway wrote:
Bill Moran wrote:
Tom Allison <tom@xxxxxxxxxxx> wrote:
If the user base gets to 100 or more, I'll be hitting a billion
rows before too long. I add about 70,000 rows per user per day.
At 100 users this is 7 million rows per day. I'll hit a billion
in 142 days, call it six months for simplicity.
The table itself is small (two columns: bigint, int) but I'm
wondering when I'll start to hit a knee in performance and how I
can monitor that. I know where I work (day job) they have Oracle
tables with a billion rows that just plain suck. I don't know
if a billion is bad or if the DBA's were not given the right
opportunity to make their tables work.
But if they are any indication, I'll feeling some hurt when I
exceed a billion rows. Am I going to just fold up and die in six
months?
Alot depends on your specific use case.
- Will you be just storing the data for archival purposes, or
frequently querying the data?
- If you need to run queries, are they well bounded to certain
subsets of the data (e.g. a particular range of time for a
particular user) or are they aggregates across the entire billion
rows?
- Is the data temporal in nature, and if so do you need to purge it
after some period of time?
As an example, I have an application with temporal data, that needs
periodic purging, and is typically queried for small time ranges
(tens of minutes). We have set up partitioned tables (partitioned
by date range and data source -- akin to your users) using
constraint exclusion that contain 3 or 4 billion rows (total of all
partitions), and we have no problem at all with performance. But I
suspect that if we needed to do an aggregate across the entire
thing it would not be particularly fast ;-)
Why not just create a simulation of 100 users and run it as hard
as your
can until it starts to degrade? Then you'll have some real-world
experience
to tell you how much you can handle.
This is good advice. Without much more detail, folks on the list
won't be able to help much, but a with simulation such as this you
can answer your own question...
Joe
Good questions. I guess there are two answers. There are times when
I will want aggregate data and I'm not as concerned about the
execution time.
But there are other queries that are part of the application design.
These are always going to be of a type where I know a single specific
primary key value and I want to find all the rows that are related.
First table has a row of
idx serial primary key
Third table has a row of
idx bigserial primary key
and a second table (the billion row table) consistes of two rows:
first_idx integer not null references first(idx) on delete cascade,
third_idx bigint not null references third(idx) on delete cascade,
constraint pkey_first_third primary key (first_idx, third_idx)
The common query will be:
select t.string
from first f, second s, third t
where f.idx = s.first_idx
and s.third_idx = t.idx
and f.idx = 4 (or whatever...).
So, I think the answer is that the data isn't going to be temporal or
otherwise segragated or subsets.
I'll assume this is a lead in for partitions?
The data will be queried very frequently. Probably plan on a query
every 10 seconds and I don't know what idx ranges will be involved.
Would it be possible to partition this by the first_idx value? An
improvement?