Re: Template zero xid issue

"Keaton Adams" <kadams@xxxxxxxxxxx> · Mon, 6 Aug 2007 13:49:17 -0600

Auto vacuum is not enabled:

#-----------------------------------------------------------------------
----
# AUTOVACUUM PARAMETERS
#-----------------------------------------------------------------------
----

#autovacuum = off                       # enable autovacuum subprocess?
#autovacuum_naptime = 60                # time between autovacuum runs,
in secs
#autovacuum_vacuum_threshold = 1000     # min # of tuple updates before
                                        # vacuum
#autovacuum_analyze_threshold = 500     # min # of tuple updates before
                                        # analyze
#autovacuum_vacuum_scale_factor = 0.4   # fraction of rel size before
                                        # vacuum
#autovacuum_analyze_scale_factor = 0.2  # fraction of rel size before
                                        # analyze
#autovacuum_vacuum_cost_delay = -1      # default vacuum cost delay for
                                        # autovac, -1 means use
                                        # vacuum_cost_delay
#autovacuum_vacuum_cost_limit = -1      # default vacuum cost limit for
                                        # autovac, -1 means use
                                        # vacuum_cost_limit

Instead, the vacuumdb command is run on the DBs to perform a full DB
vacuum every Saturday. The DB template0 is skipped by the vacuumdb
process, according to the output:

/xyz/var/pgsql/bin/vacuumdb -a -z -U postgres
vacuumdb: vacuuming database "postgres"
VACUUM
vacuumdb: vacuuming database "mxl"
VACUUM
vacuumdb: vacuuming database "template1"
VACUUM

1. Ops has a job that monitors datfrozenxid. An event was triggered this
morning because the count on template0 was approaching the limit.  I
don't have the original output, but it looked something like this:

mxl=# select datname, age(datfrozenxid) from pg_database;
   datname    |    age
--------------+------------
 postgres     | 1478038975
 mxl_survey   | 1478038263
 mxl          | 1478036889
 template1    | 1478010730
 template0    | 2146747335

2. As we were attempting to figure out why this was the case, Postgres
stopped accepting commands, as reported in the log:

<2007-08-06 09:42:50 MDT>ERROR:  database is not accepting commands to
avoid wraparound data loss in database "template0"
<2007-08-06 09:42:50 MDT>HINT:  Stop the postmaster and use a standalone
backend to vacuum database "template0".

3. One of the admins initiated a shutdown request to the DB.  The
shutdown was NOT automatically done by Postgres as I had originally
stated, but it became necessary nonetheless:

<2007-08-06 09:44:05 MDT>LOG:  received fast shutdown request
<2007-08-06 09:44:05 MDT>LOG:  aborting any active transactions
<2007-08-06 09:44:05 MDT>FATAL:  terminating connection due to
administrator command
<2007-08-06 09:44:05 MDT>FATAL:  terminating connection due to
administrator command
<2007-08-06 09:44:05 MDT>FATAL:  terminating connection due to
administrator command

4. During the shutdown we continued to receive these messages in the
log:

<2007-08-06 09:44:08 MDT>LOG:  transaction ID wrap limit is 2147484146,
limited by database "template0"
<2007-08-06 09:44:08 MDT>WARNING:  database "template0" must be vacuumed
within 736811 transactions
<2007-08-06 09:44:08 MDT>HINT:  To avoid a database shutdown, execute a
full-database VACUUM in "template0".

5. Once the DB was down, a standalone backend was used to execute a
vacuum against template0:

sudo -u postgres /mxl/var/pgsql/bin/postgres -D /mxl/var/pgsql/data -O
-P template0

>VACUUM VERBOSE ANALYZE; 

6. The DB started successfully and normal operations resumed.

We do have a second 8.1.4 DB running the same type of transaction load
and the datfrozenxid query shows template0 having a larger 'age' value
than any of the other databases in the cluster:

postgres=# select datname, age(datfrozenxid) from pg_database;
   datname    |    age
--------------+------------
 postgres     | 1108184564
 mxl          | 1108182478
 template1    | 1108156319
 template0    | 1497605405

So before it grows into the 2.1 billion value we'll need to do a manual
reset of the FrozenXID on this DB as well.

Of course, we are looking into an 8.1.9 upgrade to avoid this problem in
the future.

Thanks,

Keaton

-----Original Message-----
From: Tom Lane [mailto:tgl@xxxxxxxxxxxxx] 
Sent: Monday, August 06, 2007 1:11 PM
To: Keaton Adams
Cc: Joshua D. Drake; Alvaro Herrera; pgsql-general@xxxxxxxxxxxxxx
Subject: Re:  Template zero xid issue 

"Keaton Adams" <kadams@xxxxxxxxxxx> writes:
> Do I have this right?

It's not clear.  You didn't answer the question about whether you were
running autovac, but even if you were I'm not sure why you got the
complaint about template0.  AFAICS the only way for template0 to be
installed as oldest_datname is if vac_truncate_clog() is invoked in
template0, allowing it to be the initial selection before the loop
starts.  Which could happen if autovac is on, since it will try to
vacuum template0, and in 8.1.4 it neglects to use the FREEZE option.
But even then, if we just finished a DB-wide vacuum on template0,
it should not have had the oldest datfrozenxid; and certainly not one
old enough to provoke complaints.

Actually there's a whole lot that isn't clear about this.  Your first
report showed a warning --- not an error --- about template0, so it's
not at all clear that the "database system is shutting down" messages
had anything to do with that at all.  And you later posted some queries
that weren't drawing any such message.  What was the exact sequence of
events here?

Alvaro: you have any thoughts about this?  I'm suspicious that we ought
to change 8.1's vac_truncate_clog() to not include the current database
automatically, since it might be not-datallowconn; but I don't see how
that potential bug could actually be exposed as a fault.

			regards, tom lane

---------------------------(end of broadcast)---------------------------
TIP 2: Don't 'kill -9' the postmaster