Our application is such that there is a great deal of activity at the beginning of the hour and minimal activity near the end of the hour. Those 3 vacuums were done at (approximately) 30 minutes past, 40 minutes past, and 50 minutes past the hour, during low activity. Vacuums of pg_class look like they're being done on the hour. So it's not surprising that the first vacuum found a lot of dead rows while the latter two found very few.
In fact, I just did another vacuum (about 30 minutes past the hour again) and got:
INFO: "pg_class": found 5490 removable, 3263 nonremovable row versions in 171 pages
DETAIL: 0 dead row versions cannot be removed yet.
DETAIL: 0 dead row versions cannot be removed yet.
... and clearly a vacuum was done under an hour ago.
The truncate and re-fill process is done once per hour, at the end of the high-load cycle, so I doubt that's even a big contributor to the number of removable rows in pg_class.
For this particular setup, we expect high load for 10-15 minutes at the beginning of the hour, which is the case when a new connection is initialized. After a day or so (as is happening right now), the high-load period spills into the second half of the hour. Within 3-4 days, we start spilling into the next hour and, as you can imagine, everything gets behind and we spiral down from there. For now, our workaround is to manually kill the connection every few days, but I would like a better solution than setting up a cron job to do this!
Thanks again,
Steve
On 12/15/06, Tom Lane <tgl@xxxxxxxxxxxxx> wrote:
Hm, look at the numbers of rows removed:
> INFO: "pg_class": found 5680 removable, 3263 nonremovable row versions in
> 625 pages
> DETAIL: 0 dead row versions cannot be removed yet.
> INFO: "pg_class": found 24 removable, 3263 nonremovable row versions in 625
> pages
> DETAIL: 0 dead row versions cannot be removed yet.
> INFO: "pg_class": found 0 removable, 3263 nonremovable row versions in 625
> pages
> DETAIL: 0 dead row versions cannot be removed yet.
The lack of unremovable dead rows is good, but why were there so many
dead rows the first time? You didn't say what the cycle time is on your
truncate-and-refill process, but the last two suggest that the average
rate of accumulation of dead pg_class rows is only a couple per minute,
in which case it's been a lot longer than an hour since the previous
VACUUM of pg_class. I'm back to suspecting that you don't vacuum
pg_class regularly. You mentioned having an hourly cron job to fire off
vacuums ... are you sure it's run as a database superuser?
regards, tom lane