mljv@xxxxxxxxxxxx wrote:
We looked in our cpu monitoring and saw that we have huge IOwait while VACUUM
is running, not unusual though. But just after VACUUM was finished, the
userCPU load raised to 200% (dual cpu). At this time in the early morning
there were only very few users using our app. We analyzed the situation
further and saw that most queries had a very long duration time (minutes
instead of milliseconds). BTW: the vacuum process was not running at this
moment. It was logged before as done.
But WHAT was using all of your cpu? Was it PostgreSQL, and if so just
one backend? If something else was using all your cpu, then it might
just be that the PostgreSQL server wasn't getting a chance to run your
queries.
* Maybe the Vacuum analyze process has not enough memory and therefore could
not ANALYZE the tables correctly. It then writes wrong statistics to the
database which results in wrong execution plans using sequence scans instead
of index scans. This only happens if the vacuum analyze process runs
concurrently with user requests. If it runs on its own, the vacuum process
has enough memory and writes correct statistics to the database.
Doesn't sound likely to me. You've got 8GB of RAM, and if you were going
into swap you'd have noticed the disk I/O.
Here are some of our configuration parameters. We never really tweaked it as
it ran fine. We just raised some parameters. The following list should show
all parameters changed from the default:
max_connections = 300
shared_buffers = 30000
work_mem = 10240
OK, so that's 30,000 * 8KB = 240MB of shared_buffers
You have 10MB of work_mem and if all 300 connections were using that
much you'd have committed 3GB of your RAM for that. Of course they'll
want more than just that.
Do you really have 300 concurrent connections?
maintenance_work_mem = 163840
160MB for vacuums - should be OK given how much memory you have and the
fact that it's quiet when you vacuum.
vacuum_mem = 32000
This is what maintenance_work_mem used to be called. You can delete this
entry.
max_fsm_pages = 500000
You can track at most 500,000 pages with free space on them. In 8.2+
versions VACUUM VERBOSE will show you how many are currently being used.
Not sure about 8.1
bgwriter_lru_percent = 10.0
bgwriter_lru_maxpages = 100
bgwriter_all_percent = 5
bgwriter_all_maxpages = 200
wal_buffers = 16
checkpoint_segments = 10
If you have bursts of write activity you might want to increase this.
checkpoint_warning = 3600
effective_cache_size = 180000
That's 180,000 * 8KB = 180 * 8MB = 1.4GB
If that's really all you're using as cache, I'd reduce the number of
concurrent connections. Check free/top and see how much RAM is really
being used as disk cache.
random_page_cost = 3
Might be too high - you don't mention what disks you have.
stats_command_string = off
If you turn this one on, you'll be able to see the queries each backend
is executing as they happen. Might be useful, but does have some cost.
The crucial thing is to find out exactly what is happening when things
get very slow. Check vmstat and top, look in the pg_locks system-table
and if needs be we can see what strace says a particular backend is doing.
--
Richard Huxton
Archonet Ltd