On Fri, 12 Jun 2009, Alan McKay wrote:
So, from the perspective of both Linux and PG, is there canonical list
of "here are the most important X things to track" ?
Not really, which is why you haven't gotten such a list from anyone here.
Exactly what's important to track does vary a bit based on expected
workload, and most of the people who have been through this enough to give
you a good answer are too busy to write one (you've been in my "I should
respond to that" queue for two weeks before I found time to write).
Is there some good Linux performance monitoring and tuning reading
that you can recommend?
The only good intro to this I've ever seen, from the perspective of
monitoring things would be useful to a database administrator, is the
coverage of monitoring in "Performance Tuning for Linux Servers" by
Johnson/Huizenga/Pulavarty. Their tuning advice wasn't so useful, but
most OS tuning suggestions aren't either.
The more useful way to ask the question you'd like an answer to is "when
my server starts to perform badly, what does that correlate with?" Find
out what you need to investigate to figure that out, and you can determine
what you should have been monitoring all along. That is unfortunately
workload dependant; the stuff that tends to go wrong in a web app is very
different from what happens to a problematic data warehouse for example.
The basic important OS level stuff to watch is:
-Total memory in use
-All the CPU% numbers
-Disk read/write MB/s at all levels of granularity you can collect (total
across the system, filesystem, array, individual disk). You'll only want
to track the total until there's a problem, at which point it's nice to
have more data to drilldown into.
There's a bunch more disk and memory stats available, I rarely find them
of any use. The one Linux specific bit I do like to monitor is the line
labeled "Writeback" in /proc/meminfo/ , because that's the best indicator
of how much write cache is being done at the OS level. That's a warning
sign of many problems in an area Linux often has problems with.
On the database side, you want to periodically check the important
pg_stat-* views to get an idea how much activity and happening (and where
it's happening at), as well as looking for excessive dead tuples and bad
index utilization (which manifests by things like too many sequential
scans):
-pg_stat_user_indexes
-pg_stat_user_tables
-pg_statio_user_indexes
-pg_statio_user_tables
If your system is write-intensive at all, you should watch
pg_stat_bgwriter too to keep an eye on when that goes badly.
At a higher level, it's a good idea to graph the size of the tables and
indexes most important to your application over time.
It can be handy to track things derived from pg_stat_activity too, like
total connections and how old the oldest transaction is. pg_locks can be
handy to track stats on too, something like these two counts over time:
select (select count(*) from pg_locks where granted) as granted,(select
count(*) from pg_locks where not granted) as ungranted;
That's the basic set I find myself looking at regularly enough that I wish
I always had a historical record of them from the system. Never bothered
to work this into a more formal article because a) the workload specific
stuff makes it complicated to explain for everyone, b) the wide variation
in and variety of monitoring tools out there, and c) wanting to cover the
material right which takes a while to do on a topic this big.
--
* Greg Smith gsmith@xxxxxxxxxxxxx http://www.gregsmith.com Baltimore, MD
--
Sent via pgsql-performance mailing list (pgsql-performance@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance