Greg Smith wrote:
On Mon, 11 Jun 2007, Steve Crawford wrote:
In my experience the more common situation is to "go off a cliff."
Yeah, I think the idea that you'll notice performance degrading and be
able to extrapolate future trends using statistical techniques is a
bit...optimistic.
Anyway, back to the original question here. If you're worried about
catching when performance starts becoming an issue, you need to do some
sort of logging of how long statements are taking to execute. The main
choice is whether to log everything, at which point the logging and
sorting through all the data generated may become its own performance
concern, or whether to just log statements that take a long time and
then count how many of them show up. Either way will give you some sort
of early warning once you get a baseline; it may take a bit of tweaking
to figure out where to draw the line at for what constitutes a "long"
statement if you only want to see how many of those you get.
There are two tools you should look at initially to help process the
logging information you get back: pgFouine and PQA. Here are intros to
each that also mention how to configure the postgresql.conf file:
http://pgfouine.projects.postgresql.org/tutorial.html
http://www.databasejournal.com/features/postgresql/article.php/3323561
As they're similar programs, which would work better for you is hard to
say; check out both and see which seems more practical or easier to get
running. For example, if you only have one of PHP/Ruby installed, that
may make one tool or the easier preferred.
If you can get yourself to the point where you can confidently say
something like "yesterday we had 346 statements that took more then
200ms to execute, which is 25% above this month's average", you'll be in
a positition to catch performance issues before they completely
blindside you; makes you look good in meetings, too.
Starting to sound like a sane idea.
I've been running a test job for almost 24 hours and have accumulated only 8
million rows. That's another 125 days to get to the big 'B'. I think by then
I'll have blown a hard drive or worse. I'm running this on some very old
hardware that I have available (more of this at the bottom).
However, at this point the machine is running all of the SQL at < 0.2 seconds
each. Which I consider just fine for 7,599,519 rows.
Here's some specifics about the tables:
count() from headers: 890300
count() from tokens: 890000
count() from header_token: 7599519
CREATE TABLE header_token (
header_idx integer NOT NULL,
token_idx integer NOT NULL
);
CREATE TABLE headers (
idx serial NOT NULL,
hash character varying(64) NOT NULL
);
CREATE TABLE tokens (
idx bigserial NOT NULL,
hash character varying(64) NOT NULL
);
ALTER TABLE ONLY headers
ADD CONSTRAINT headers_hash_key UNIQUE (hash);
ALTER TABLE ONLY headers
ADD CONSTRAINT headers_pkey PRIMARY KEY (idx);
ALTER TABLE ONLY header_token
ADD CONSTRAINT pkey_header_token PRIMARY KEY (header_idx, token_idx);
ALTER TABLE ONLY tokens
ADD CONSTRAINT tokens_hash_key UNIQUE (hash);
ALTER TABLE ONLY tokens
ADD CONSTRAINT tokens_pkey PRIMARY KEY (idx);
ALTER TABLE ONLY header_token
ADD CONSTRAINT header_token_header_idx_fkey FOREIGN KEY (header_idx)
REFERENCES headers(idx) ON DELETE CASCADE;
ALTER TABLE ONLY header_token
ADD CONSTRAINT header_token_token_idx_fkey FOREIGN KEY (token_idx)
REFERENCES tokens(idx) ON DELETE CASCADE;
The SQL I was timing were:
select t.hash, h.hash
from headers h, header_token ht, tokens t
where h.idx = ht.header_idx
and ht.token_idx = t.idx
and h.idx = ?
insert into header_token
select $header, idx from tokens where idx in (...)
The SELECT was <0.2
The INSERT was easily <.7 (most of the time -- ranged because the idx IN (..)
varied from 200 to 700. The min was <2 and the max was >1.0 from a few minutes
of observation.
All of this was run on a Pentium II 450 MHz with 412MB RAM and a software linear
0 pair or UDMA 66 7200RPM 8MB Cache drives (really old) on seperate IDE channels
with ReiserFS disk format. The actual script was running on a seperate machine
across a 100-base-T full duplex network through a firewall machine between the
two subnets.
I can't imagine how long it would take to run:
delete from tokens;
with the CASCADE option...