On Sep 27, 2006, at 12:35 PM, Rafal Pietrak wrote:
Thenx Duncan for the analysis.
This happend again, so I'm able to peek at the details you've pointed
out.
On Wed, 2006-09-27 at 09:33 -0700, Casey Duncan wrote:
Sounds like it was blocked (unsure by what). You can use pg_locks to
check that.
That view currently contais 240 entries .... source of blocking. Now I
have to dig the ultimate reason why those locks end up there (and are
not being cleand). Any ideas?
You need to look at the procpid that's holding the lock and then use
pg_stat_activity to see what it is doing. I'm not sure what locks
that query needs to hold, it may be blocked by any concurrent query
from a user in that group, I dunno.
Whould those locks go away by themselves .... in time?
Yeah, once the transactions holding the locks complete.
What was consuming all the CPU? I assume from what you wrote
previously that it is a postgres backend. If so, what was it doing?
----------------ps output (just one backend)------------
postgres 4759 0.0 2.0 19368 5292 ? S 21:02 0:00
postgres: czyn_738773 czyn 127.0.0.1(55431) UPDATE waiting
-------------------------------------------------------
"czyn" is the name of the database, "czyn_738773" is database
username.
IP address is localhost, since the database purpose is web serbvice
back-end.
If the update is waiting, it is blocked and should not be consuming
any noticeable cpu. This was using lots of CPU at this moment?
Still, I'm not sure what that UPDATE counld that be.
You can use ps, the pg_stat_activity view and strace to figure
that out.
------------- exerpt from pg_stat_activity-----------------
17660 | czyn | 4759 | 19541 | czyn_738773 | <command string not
enabled> | | 2006-09-27 21:02:05.914728+02 | 127.0.0.1 |
55431
(18 rows)
------------------------------------------
But this is not very meaningfull to me.
You need to add this to your postgresql.conf:
stats_command_string = on
Then send a HUP signal to the postmaster (you can use pgctl reload
for this or 'kill -HUP')
Then you can see what the exact command is.
Another option that may be helpful to diagnose is:
log_min_duration_statement = 500
This will output the sql for all commands taking longer than 500ms.
You can adjust this up and down depending on what counts as "slow" in
your app. You probably only want it on temporarily, but I find it
sometimes helpful to always leave it on with a suitably high value
(like 1000). That way you can always troll the logs for laggards.
...another postgress-process restart, and another VACCUM ANALYSE,
this
one completes in 3-5 secs. Now "ALTER GROUP ..." goes OK. And
everything
seams working.
Was something else using the database previously? Was something else
different this time than the last?
No. My impression was (naive, I know), that not all things were
correctly cleand-up on the first restart. Now I think, that on the
second go I was just a little bit more lucky - www service was calm
for
a little longer. That was anough for VACUUM to complete.
Yeah, it sounds like things are fairly busy given the locks above.
Still, throuble started later.
How big is the data on disk? Is it possible that you have queries
close to nothing: c.a. 200MiB
Sounds like the problem is more about concurrency then data size. How
many concurrent connections do you have (select count(*) from
pg_stat_activity)? How many are in transaction and running queries
concurrently?
If you see multiple instances of a given query (or the same one
appearing frequently), try running EXPLAIN ANALYZE on it. That'll
tell you what makes it slow.
problem (hardware, os, fs). When you saw this happen did you happen
to check if it was user or system cpu usage? I had a corrupt
It's the user CPU time.
Yeah, just sounds like a scaling issue. Hopefully easily solved ;^)
As it currently looks like some unknown UPDATE causes the trouble, how
can I identify what it is?
see config recommendations above.
1. the database is not very sophisticated, apart from what I've
mentioned earlier, it does have just a few stored procedures in
pgsql -
nothing fancy. And there is no triggers just specialized views
(although
quite a lot of referencial integrity constrains).
These can be a source of lock contention on updates (though this is
much better on 8.1 compared to 7.4). Once you know the update query
that's getting blocked you can know for sure.
2. machine is 'from the old days': 450MHz CPU
What is the iowait like on this box? You can use iostat to determine
this. That will tell you if the disks are also struggling.
3. the system is currently under 'available-for-public' condition, so
there may be 'waves' (peeks) of activity.
4. when the system was tested in-lab (small number of users - no
risk /
not-tested for havier load behavior), I haven't noticed this
'blocking'.
With any luck the above will help point out some suboptimal queries
that are causing the headache.
-Casey