Transaction wraparound problem due to wrong datfrozenxid?

Arctic Toucan <arctic_toucan@xxxxxxxxxxx> · Thu, 17 Nov 2011 09:08:36 -0500

I ran into a rather unusual problem today where  Postgres brought down a database to avoid transaction wraparound in a situation where it doesn't appear that it should have.

The error in the log is explicit enough...

Nov 16 04:00:03 SRP1 postgres[58101]: [1-1] FATAL:  database is not accepting commands to avoid wraparound data loss in database "stat"
Nov 16 04:00:03 SRP1 postgres[58101]: [1-2] HINT:  Stop the postmaster and use a standalone backend to vacuum database "stat".

Yet, going back several days in the logs, there were none of the usual WARNING messages in the log about this situation occurring in xxx transactions.

When I query datfrozenxid value in pg_database it certainly showed a problem.

backend> SELECT datname,datfrozenxid, age(datfrozenxid) FROM pg_database;
         1: datname     (typeid = 19, len = 64, typmod = -1, byval = f)
         2: datfrozenxid        (typeid = 28, len = 4, typmod = -1, byval = t)
         3: age (typeid = 23, len = 4, typmod = -1, byval = t)
        ----
         1: datname = "postgres"        (typeid = 19, len = 64, typmod = -1, byval = f)
         2: datfrozenxid = "2699851604" (typeid = 28, len = 4, typmod = -1, byval = t)
         3: age = "1269165380"  (typeid = 23, len = 4, typmod = -1, byval = t)
        ----
         1: datname = "stat" (typeid = 19, len = 64, typmod = -1, byval = f)
         2: datfrozenxid = "1822525199" (typeid = 28, len = 4, typmod = -1, byval = t)
***         3: age = "2146491785"  (typeid = 23, len = 4, typmod = -1, byval = t)
        ----
         1: datname = "config"       (typeid = 19, len = 64, typmod = -1, byval = f)
         2: datfrozenxid = "3869013990" (typeid = 28, len = 4, typmod = -1, byval = t)
         3: age = "100002994"   (typeid = 23, len = 4, typmod = -1, byval = t)
        ----
         1: datname = "template1"       (typeid = 19, len = 64, typmod = -1, byval = f)
         2: datfrozenxid = "2000352260" (typeid = 28, len = 4, typmod = -1, byval = t)
         3: age = "1968664724"  (typeid = 23, len = 4, typmod = -1, byval = t)
        ----
         1: datname = "template0"       (typeid = 19, len = 64, typmod = -1, byval = f)
         2: datfrozenxid = "2000357564" (typeid = 28, len = 4, typmod = -1, byval = t)
         3: age = "1968659420"  (typeid = 23, len = 4, typmod = -1, byval = t)
        ----

I then ran a query to check the age of all the relfrozenxid on each of the tables in the DB and it indicated that they were all OK.

backend> select sum(case when age(relfrozenxid) > 2000000000 then 1 else 0 end) as gt_2billion, sum(case when age(relfrozenxid) between 1500000000 and 2000000000 then 1 else 0 end) as gt_1_5billion, sum(case when age(relfrozenxid) between 1000000000 and 1500000000 then 1 else 0 end) as gt_1billion, sum(case when age(relfrozenxid) between 500000000 and 1000000000 then 1 else 0 end) as gt_500million, sum(case when age(relfrozenxid) between 100000000 and 500000000 then 1 else 0 end) as gt_100million, sum(case when age(relfrozenxid) < 100000000 then 1 else 0 end) as lt_100million from pg_class where relkind in ('r','t');
         1: gt_2billion (typeid = 20, len = 8, typmod = -1, byval = f)
         2: gt_1_5billion       (typeid = 20, len = 8, typmod = -1, byval = f)
         3: gt_1billion (typeid = 20, len = 8, typmod = -1, byval = f)
         4: gt_500million       (typeid = 20, len = 8, typmod = -1, byval = f)
         5: gt_100million       (typeid = 20, len = 8, typmod = -1, byval = f)
         6: lt_100million       (typeid = 20, len = 8, typmod = -1, byval = f)
        ----
         1: gt_2billion = "0"   (typeid = 20, len = 8, typmod = -1, byval = f)
         2: gt_1_5billion = "0" (typeid = 20, len = 8, typmod = -1, byval = f)
         3: gt_1billion = "0"   (typeid = 20, len = 8, typmod = -1, byval = f)
         4: gt_500million = "628"       (typeid = 20, len = 8, typmod = -1, byval = f)
         5: gt_100million = "8928"      (typeid = 20, len = 8, typmod = -1, byval = f)
         6: lt_100million = "0" (typeid = 20, len = 8, typmod = -1, byval = f)
        ----

I confirmed this with...

 backend>  select relname,relfrozenxid,age(relfrozenxid) from pg_class where relkind in('r','t') order by 3 desc limit 1;
          1: relname     (typeid = 19, len = 64, typmod = -1, byval = f)
          2: relfrozenxid        (typeid = 28, len = 4, typmod = -1, byval = t)
          3: age (typeid = 23, len = 4, typmod = -1, byval = t)
         ----
          1: relname = "qoe_flowbwidth_dist_dig1_014"    (typeid = 19, len = 64, typmod = -1, byval = f)
          2: relfrozenxid = "2970264132" (typeid = 28, len = 4, typmod = -1, byval = t)
          3: age = "998752902"   (typeid = 23, len = 4, typmod = -1, byval = t)
         ----

My understanding has always been that the datfrozenxid should match this relfrozenxid, which it evidently doesn't. 

My environment:

FreeBSD 6
PG 8.2.4(Yes, I intend to upgrade, which would be helped if someone can say that this problem is fixed in some future patch/release release)

approx. 300GB database, with table partitioning, and lots of data inserts, very few updates. It has been running for a couple of years with relatively few problems. Yeh Postgres! 

autovacuum enabled

MY QUESTION: Can anyone offer up ideas on what could cause this situation in a seemingly stable system? I'm just trying to provide a "Root Cause Analysis" to management and I don't know quite how to explain the root cause :-)

Additional information...

I have warm standby set up on this database and failing over to the standby resulted in exactly the same problem(No real surprise).

I inadvertently, "fixed" the problem by running a vacuum command against a single existing table while in single-user mode. Voila, the datfrozenxid matched the oldest relfrozenxid and I could restart the DB. Since I had a standby database in the same state, I tried running a vacuum there against a sequence instead of a table. That also fixed the problem somehow even though it wasn't really a legitimate command which makes me think it may be some difference in autovacuum versus vacuum processing, but I am no C-programmer to debug that.

 autovacuum_freeze_max_age 
---------------------------
 1000000000
(1 row)

 vacuum_freeze_min_age 
-----------------------
 100000000
(1 row)

Thx...Mark