Re: Problem after VACUUM ANALYZE

Richard Huxton <dev@xxxxxxxxxxxx> · Tue, 08 Apr 2008 16:40:27 +0100

mljv@xxxxxxxxxxxx wrote:
We looked in our cpu monitoring and saw that we have huge IOwait while VACUUM 
is running, not unusual though. But just after VACUUM was finished, the 
userCPU load raised to 200% (dual cpu). At this time in the early morning 
there were only very few users using our app. We analyzed the situation 
further and saw that most queries had a very long duration time (minutes 
instead of milliseconds). BTW: the vacuum process was not running at this 
moment. It was logged before as done.

But WHAT was using all of your cpu? Was it PostgreSQL, and if so just 
one backend? If something else was using all your cpu, then it might 
just be that the PostgreSQL server wasn't getting a chance to run your 
queries.

* Maybe the Vacuum analyze process has not enough memory and therefore could 
not ANALYZE the tables correctly. It then writes wrong statistics to the 
database which results in wrong execution plans using sequence scans instead 
of index scans. This only happens if the vacuum analyze process runs 
concurrently with user requests. If it runs on its own, the vacuum process 
has enough memory and writes correct statistics to the database.

Doesn't sound likely to me. You've got 8GB of RAM, and if you were going 
into swap you'd have noticed the disk I/O.

Here are some of our configuration parameters. We never really tweaked it as 
it ran fine. We just raised some parameters. The following list should show 
all parameters changed from the default:

max_connections = 300
shared_buffers = 30000
work_mem = 10240

OK, so that's 30,000 * 8KB = 240MB of shared_buffers
You have 10MB of work_mem and if all 300 connections were using that 
much you'd have committed 3GB of your RAM for that. Of course they'll 
want more than just that.

Do you really have 300 concurrent connections?

maintenance_work_mem = 163840

160MB for vacuums - should be OK given how much memory you have and the 
fact that it's quiet when you vacuum.

vacuum_mem = 32000

This is what maintenance_work_mem used to be called. You can delete this 
 entry.

max_fsm_pages = 500000

You can track at most 500,000 pages with free space on them. In 8.2+ 
versions VACUUM VERBOSE will show you how many are currently being used. 
Not sure about 8.1

bgwriter_lru_percent = 10.0
bgwriter_lru_maxpages = 100
bgwriter_all_percent = 5
bgwriter_all_maxpages = 200
wal_buffers = 16 

checkpoint_segments = 10  

If you have bursts of write activity you might want to increase this.

checkpoint_warning = 3600

effective_cache_size = 180000

That's 180,000 * 8KB = 180 * 8MB = 1.4GB
If that's really all you're using as cache, I'd reduce the number of 
concurrent connections. Check free/top and see how much RAM is really 
being used as disk cache.

random_page_cost = 3

Might be too high - you don't mention what disks you have.

stats_command_string = off

If you turn this one on, you'll be able to see the queries each backend 
is executing as they happen. Might be useful, but does have some cost.

The crucial thing is to find out exactly what is happening when things 
get very slow. Check vmstat and top, look in the pg_locks system-table 
and if needs be we can see what strace says a particular backend is doing.

--
  Richard Huxton
  Archonet Ltd