Problem after VACUUM ANALYZE

mljv@xxxxxxxxxxxx · Tue, 8 Apr 2008 16:57:26 +0200

Hi all,

our postgresql DB was running fine for a long time, but suddenly we 
encountered a huge problem which we got fixed only temporarily.

We are running debian stable with postgresql 8.1.11. Our app is connecting via 
JDBC and uses Prepared Statements. 

We are not running autovacuum but a nightly VACUUM ANALYZE at 3:00 am 
(without 'full' option of course). 

Our largest table has 80,000,000 records and has a relation size of 4.4 GB. 
Indizes of this table are 5.5 GB (total_relation_size = 10GB). this table is 
queried a lot of times. The second largest table is not even 20% of the 
largest table. Other tables range from 10KB to 300MB (total_relation_size).

The server is dedicated DB server with dual cpu (AMD Athlon(tm) 64 X2), 8 GB 
of ram, 3ware RAID-1 with SATA harddisks.

Everything ran fine. The Vacuum process took quite a long time each night 
(about 60 minutes) but we didn't care as it was a very low traffic time.

Suddenly one morning the database was still running but it was VERY slow. Lots 
of SELECT queries were hanging around waiting for an answer.

The VACUUM process was already done at this moment.

First try was to restart the database. After the restart the problem occured 
again. The vaccum process was already done BEFORE the restart and was not run 
again. We just restarted the database. Next try was a REINDEX while 
disonnecting all clients, but it didn't helped either.

Next try was to disconnect all clients again (servlet containers), then we did 
a restart of the database and run "VACUUM ANALYZE" as the only query with no 
other concurrent queries. It took an hour. Afterwards we started the database 
and everything ran smoothly.

We thought it might have been only some kind of weired circumstances. So we 
left everything else as it was. 

Next night the Vacuum process did run again and after this the database slowed 
down immediately. The same behaviour as the night before. We stopped the 
database, disconnected all clients and ran VACUUM ANALYZE without concurrent 
queries. After restarting everything was fine again.

We looked in our cpu monitoring and saw that we have huge IOwait while VACUUM 
is running, not unusual though. But just after VACUUM was finished, the 
userCPU load raised to 200% (dual cpu). At this time in the early morning 
there were only very few users using our app. We analyzed the situation 
further and saw that most queries had a very long duration time (minutes 
instead of milliseconds). BTW: the vacuum process was not running at this 
moment. It was logged before as done.

As we cannot afford a downtime every morning, we disabled the VACUUM process 
at night. Next morning everything was fine and is till then (10 days ago).

Of course, not vacuuming at all is not a good idea. 

I don't have any clue what is happening. My thoughts about this as follows 
(might be totally wrong):

* Maybe the Vacuum analyze process has not enough memory and therefore could 
not ANALYZE the tables correctly. It then writes wrong statistics to the 
database which results in wrong execution plans using sequence scans instead 
of index scans. This only happens if the vacuum analyze process runs 
concurrently with user requests. If it runs on its own, the vacuum process 
has enough memory and writes correct statistics to the database.

Here are some of our configuration parameters. We never really tweaked it as 
it ran fine. We just raised some parameters. The following list should show 
all parameters changed from the default:

$ cat /etc/postgresql/8.1/main/postgresql.conf | grep -v '^[[:space:]]*#' | 
grep -v '^$'
listen_addresses = '*'
port = 5432
max_connections = 300
unix_socket_directory = '/var/run/postgresql'
shared_buffers = 30000
work_mem = 10240
maintenance_work_mem = 163840
vacuum_mem = 32000
max_fsm_pages = 500000
bgwriter_lru_percent = 10.0
bgwriter_lru_maxpages = 100
bgwriter_all_percent = 5
bgwriter_all_maxpages = 200
wal_buffers = 16 
checkpoint_segments = 10  
checkpoint_warning = 3600
effective_cache_size = 180000       
random_page_cost = 3
log_min_messages = info
log_min_error_statement = warning
log_min_duration_statement = 500
log_line_prefix = '%t [%p]'
stats_command_string = off
stats_block_level = on
stats_row_level = on
stats_reset_on_server_start = on
autovacuum = off                       
autovacuum_naptime = 3000    
lc_messages = 'en_US'                 
lc_monetary = 'en_US'           
lc_numeric = 'en_US'          
lc_time = 'en_US'            

I already tried to google my problem and came across some answers which seems 
to fit, but were all slightly different. Of course updating to 8.3 or restore 
the  whole database might help, but i really want to understand what i did 
wrong to improve my skills. 

If any further information is needed, please ask. I hope my information was 
not too veborse and somebody can help me with my problem. Your help is very 
appreciated. 

kind regards,
janning