[Paul Lathrop - Thu at 02:59:27PM -0800] > growing disk space usage. The DBA had come to the conclusion that the > VACUUM command did/does not work on these systems, because even after a > VACUUM FULL, the size of the database was continually increasing. So, as > things stand with the PG7.2 machines, vacuuming is run nightly, and > whenever the database size reaches 40Gb on disk (the point at which > performance has degraded below tolerance), the DBA exports the data, > deletes the database, and then imports the data, shrinking it to the > actual size of the dataset. We found one reason why vacuuming didn't always work for us - we had long running transactions - in addition to killing the vacuum, it did really nasty things to the performance in general. To check for those transactions, I think it's needed to turn on stats_command_string in the config. I use this query to check: select * from pg_stat_activity where current_query<>'<IDLE>' order by query_start ; If you spot any "<IDLE> in transaction" with an old query_start timestamp, then that's most probably the reason. Long running transactions doesn't have to be idle ... check the pg_locks view for the lowest transactionid and compare (through the pid) with the pg_stat_activity view to find the actual backend. > However, we still are suffering a gradual decrease in performance over > time - or so the application engineers claim. The DBA and I have been > banging our heads against this for a month. We're having the same issues, so we do the dumping and restoring every now and then to be sure everything is properly cleaned up. With 8.1. > 1) How does one define 'performance' anyway? Is it average time to > complete a query? If so, what kind of query? Is it some other metric? We have the same kind of problem, and the project leader (I sometimes refer him as the "bottleneck" ;-) is most concerned about iowait at our cpu graphs. Anyway, we do have other measures: - our applications does log the duration of each request towards the application as well as each query towards the database. If the request (this is web servers) is taking "too long" time, it's logged as error instead of debug. If a significant number of such errors is due to database calls taking too much time, then the performance is bad. Unfortunately, we have no way to automate such checking. - I've setting up two scripts pinging that pg_stat_activity view every now and then, logging how much "gruff" it finds there. Those two scripts are eventually to be merged. One is simply logging what it finds, the other is a plugin system to the Munin graphing package. I've thrown the scripts we use out here: http://oppetid.no/~tobixen/pg_activity_log.txt http://oppetid.no/~tobixen/pg_activity.munin.txt (I had to rename them to .txt to get the web server to play along). Those are very as-is, should certainly be modified a bit to fit to any other production environment. :-) The pg_activity_log dumps a single number indicating the "stress level" of the database to a file. I think this stress number, when taking out i.e. the 20% worst numbers from the file for each day, can indicate something about the performance of the database server. However, I haven't had the chance to discuss it with the bottleneck yet.