Re: At what point does a big table start becoming too big?

Bill Moran <wmoran@xxxxxxxxxxxxxxxxx> · Thu, 23 Aug 2012 22:22:22 -0400

On Thu, 23 Aug 2012 17:56:37 -0700 Jeff Janes <jeff.janes@xxxxxxxxx> wrote:

> On Thu, Aug 23, 2012 at 6:37 AM, Bill Moran <wmoran@xxxxxxxxxxxxxxxxx> wrote:
> > In response to "Martin French" <Martin.French@xxxxxxxxxxxxx>:
> >> >
> >> > I have a table with 40 million rows and haven't had any performance
> >> > issues yet.
> >> >
> >> > Are there any rules of thumb as to when a table starts getting too big?
> >> >
> >> > For example, maybe if the index size is 6x the amount of ram, if the
> >> > table is 10% of total disk space, etc?
> >>
> >> My rule here is that a table is too big when performance starts degrading beyond an acceptable level.
> >
> > The challenge there is that if you wait until performance degrades
> > beyond an acceptable level, you've allowed yourself to get into a
> > situation where clients are upset and frustrated, and fixing the
> > problem is difficult because there's so much data to manipulate to
> > rearrange things.
> 
> Yes, I agree with that.
> 
> > And the advice I have along those lines is to establish now what
> > constitutes unacceptable performance, and put some sort of monitoring
> > and tracking in place to know what your performance degradation looks
> > like and predict when you'll have to react.  For example, a MRTG
> > graph that runs an experimental query once a day during off hours and
> > graphs the time it takes vs. the # of rows in the table will prove
> > a valuable tool that can sometimes predict exactly when you'll have
> > to change things before it becomes a problem.
> 
> This seems inconsistent with your previous advice.  By the time your
> experimental query shows a problem, you no longer have any maintenance
> windows left large enough to fix it.  Unless your experimental query
> was a reindex or something non-production like that, in which case
> running it on a production server, even off-hours, doesn't seem like a
> good idea.

Perhaps I didn't explain the approach sufficiently.

If you can establish something like, "This specific SELECT has to run
in under 5 minutes to meet the client's expectations" you can then
time how long that query takes each time it's run (by capturing that
information in the application, for example ... or by running it in
some automated fashion ... possibly other methods as well).

If you capture that runtime on a regular basis and put the results
on a graph in concert with other relevant data, such as the number
of rows in the related tables, size of the data, etc, you quickly
get a good picture of how fast things are growing, and frequently
you can project the line out into the future and say things like
"if we don't come up with a better way to do this by Sept of next
year, we're going to exceed our allowed run time."  You can then
take that very detailed information to business planners and point
out that they need to schedule developer time _before_ then if they
don't want the application to slow down below the allowable level.

Unless you work for somewhere that has unlimited resources, your
time is always split between feature requests, day to day operations,
firefighting, etc.  In my experience, keeping things like this
under control is often a matter of having enough information to
justify why your optimization project is more important than
whizbang feature x that marketing wants so bad.

Of course, if you work somewhere with unlimited resources, you
should let me know so I can send in my resume.

And none of what I'm suggesting is intended to belittle the other
suggestions either -- if you know of a way to optimize the data
better, why not do it now?  If you can be purging old data, why
wait until performance is a problem to start purging, etc.

It's just another trick to have in your bag.

-- 
Bill Moran <wmoran@xxxxxxxxxxxxxxxxx>

-- 
Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general