Autovaccum failure with Postgres 8.3.14? (max-fsm_pages issue?)

Subhasis Bhattacharya <subhasis.bhattacharya@xxxxxxxxx> · Fri, 23 Nov 2012 17:02:37 +0530

Hi All,
Recently a postgres database embedded within our product faced a series of hiccups at a customer site, as follows:

For about 24 h ours the postgres log file had logged errors like:

2012-09-24 00:00:12 GMTLOG:  could not rename temporary statistics file "global/pgstat.tmp" to "global/pgstat.stat": Input/output error
2012-09-24 00:00:13 GMTLOG:  could not rename temporary statistics file "global/pgstat.tmp" to "global/pgstat.stat": Input/output error
2012-09-24 00:00:15 GMTLOG:  could not rename temporary statistics file "global/pgstat.tmp" to "global/pgstat.stat": Input/output error
2012-09-24 00:00:16 GMTLOG:  could not rename temporary statistics file "global/pgstat.tmp" to "global/pgstat.stat": Input/output error
2012-09-24 00:00:20 GMTLOG:  could not rename temporary statistics file "global/pgstat.tmp" to "global/pgstat.stat": Input/output error
2012-09-24 00:00:21 GMTLOG:  could not rename temporary statistics file "global/pgstat.tmp" to "global/pgstat.stat": Input/output error
2012-09-24 00:00:22 GMTLOG:  could not rename temporary statistics file "global/pgstat.tmp" to "global/pgstat.stat": Input/output error
2012-09-24 00:00:22 GMTLOG:  could not rename temporary statistics file "global/pgstat.tmp" to "global/pgstat.stat": Input/output error
2012-09-24 00:00:23 GMTLOG:  could not rename temporary statistics file "global/pgstat.tmp" to "global/pgstat.stat": Input/output error
2012-09-24 00:00:24 GMTLOG:  could not rename temporary statistics file "global/pgstat.tmp" to "global/pgstat.stat": Input/output error

....
Its an extremely volatile database with rows deleted very often....

After about 24 hours we saw this:
2012-10-01 00:19:21 GMTWARNING:  relation "pg_toast.pg_toast_16509" contains more than "max_fsm_pages" pages with useful free space
2012-10-01 00:19:21 GMTHINT:  Consider using VACUUM FULL on this relation or increasing the configuration parameter "max_fsm_pages".
2012-10-01 00:57:12 GMTWARNING:  relation "pbs.job_attr" contains more than "max_fsm_pages" pages with useful free space
2012-10-01 00:57:12 GMTHINT:  Consider using VACUUM FULL on this relation or increasing the configuration parameter "max_fsm_pages".
2012-10-01 01:09:30 GMTWARNING:  relation "pg_toast.pg_toast_16509" contains more than "max_fsm_pages" pages with useful free space
2012-10-01 01:09:30 GMTHINT:  Consider using VACUUM FULL on this relation or increasing the configuration parameter "max_fsm_pages".
2012-10-01 01:18:45 GMTERROR:  canceling autovacuum task

So looks like autovacuum encountered  a situation it could not handle and was asking for a manual vacuum full. Queries henceforth ran very very slow, since autovacuum kept popping up and cancelling itself...

My question to the postgres guru's or developer's:

- I read that the autovacuum daemon depends on the statistics collector to do its job properly. Could it be that the fact that the collector could not update the pgstat file earlier for over 24 hours or so, led autovacuum to NOT do its job leading to the situation where autovacuum could no longer handle it....

- In a normal situation, should we assume that autovacuum must be able to do it job without needing a manual vacuum full ever?

Thanks and Regards,
Subhasis Bhattacharya