Re: Problem after VACUUM ANALYZE

mljv@xxxxxxxxxxxx · Tue, 8 Apr 2008 18:12:12 +0200

HI Richard,

thanks for your immediate response. I will answer your questions below:

Am Dienstag, 8. April 2008 17:40 schrieb Richard Huxton:
> mljv@xxxxxxxxxxxx wrote:
> > We looked in our cpu monitoring and saw that we have huge IOwait while
> > VACUUM is running, not unusual though. But just after VACUUM was
> > finished, the userCPU load raised to 200% (dual cpu). At this time in the
> > early morning there were only very few users using our app. We analyzed
> > the situation further and saw that most queries had a very long duration
> > time (minutes instead of milliseconds). BTW: the vacuum process was not
> > running at this moment. It was logged before as done.
>
> But WHAT was using all of your cpu? Was it PostgreSQL, and if so just
> one backend? If something else was using all your cpu, then it might
> just be that the PostgreSQL server wasn't getting a chance to run your
> queries.

The CPU was used by postgresql. As i saw many SELECTS by looking at "ps ax" i 
guess that many backends using all of the cpu.

> > * Maybe the Vacuum analyze process has not enough memory and therefore
> > could not ANALYZE the tables correctly. It then writes wrong statistics
> > to the database which results in wrong execution plans using sequence
> > scans instead of index scans. This only happens if the vacuum analyze
> > process runs concurrently with user requests. If it runs on its own, the
> > vacuum process has enough memory and writes correct statistics to the
> > database.
>
> Doesn't sound likely to me. You've got 8GB of RAM, and if you were going
> into swap you'd have noticed the disk I/O.

That makes sense. We had not any IOwait at this moment but 200% userCPU. 
But maybe vaccum can't use the ram because of a mis-configuration.

> > Here are some of our configuration parameters. We never really tweaked it
> > as it ran fine. We just raised some parameters. The following list should
> > show all parameters changed from the default:
> >
> > max_connections = 300
> > shared_buffers = 30000
> > work_mem = 10240
>
> OK, so that's 30,000 * 8KB = 240MB of shared_buffers
> You have 10MB of work_mem and if all 300 connections were using that
> much you'd have committed 3GB of your RAM for that. Of course they'll
> want more than just that.
>
> Do you really have 300 concurrent connections?

At peaktime, yes. 

> > maintenance_work_mem = 163840
>
> 160MB for vacuums - should be OK given how much memory you have and the
> fact that it's quiet when you vacuum.
>
> > vacuum_mem = 32000
>
> This is what maintenance_work_mem used to be called. You can delete this
>   entry.
>
> > max_fsm_pages = 500000
>
> You can track at most 500,000 pages with free space on them. In 8.2+
> versions VACUUM VERBOSE will show you how many are currently being used.
> Not sure about 8.1
>
> > bgwriter_lru_percent = 10.0
> > bgwriter_lru_maxpages = 100
> > bgwriter_all_percent = 5
> > bgwriter_all_maxpages = 200
> > wal_buffers = 16
> >
> > checkpoint_segments = 10
>
> If you have bursts of write activity you might want to increase this.
>
> > checkpoint_warning = 3600
> >
> > effective_cache_size = 180000
>
> That's 180,000 * 8KB = 180 * 8MB = 1.4GB
> If that's really all you're using as cache, I'd reduce the number of
> concurrent connections. Check free/top and see how much RAM is really
> being used as disk cache.
>
> > random_page_cost = 3
>
> Might be too high - you don't mention what disks you have.

3ware RAID-1 Controller with plain (cheap) SATA disks

> The crucial thing is to find out exactly what is happening when things
> get very slow. Check vmstat and top, look in the pg_locks system-table
> and if needs be we can see what strace says a particular backend is doing.

The problem is that we have peaktimes were everything is running fine. It has 
something to do with the vacuum process running. To simplify my problem:

- I run vaccum analyze concurrently with some few user queries: slows down to 
a crawl after vacuum is finished.

- if i run it while no user is connected, everything runs fine afterwards.

It has something to do with the vacuum analyze process. 

kind regards,
janning