Re: re PG 9.6x and found xmin from before relfrozenxid and removal of pg_internal.init file(s)

Tom Lane <tgl@xxxxxxxxxxxxx> · Mon, 28 Sep 2020 14:29:29 -0400

Reid Thompson <Reid.Thompson@xxxxxxxxxxxx> writes:
> On Mon, 2020-09-28 at 12:15 -0400, Tom Lane wrote:
>> I'm a bit dubious that that'd actually help, but it's perfectly safe
>> if you want to try it.  pg_internal.init is just a cache file that
>> will be rebuilt if it's missing.

> appears to allow to vacuum to complete... and stops the error messages
> to the log file.

Ah, after digging around in our git history, this symptom seems to match
this bug fix:

Author: Andres Freund <andres@xxxxxxxxxxx>
Branch: master Release: REL_11_BR [a54e1f158] 2018-06-12 11:13:21 -0700
Branch: REL_10_STABLE Release: REL_10_5 [2ce64caaf] 2018-06-12 11:13:21 -0700
Branch: REL9_6_STABLE Release: REL9_6_10 [6a46aba1c] 2018-06-12 11:13:21 -0700
Branch: REL9_5_STABLE Release: REL9_5_14 [14b3ec6f3] 2018-06-12 11:13:21 -0700
Branch: REL9_4_STABLE Release: REL9_4_19 [817f9f9a8] 2018-06-12 11:13:22 -0700
Branch: REL9_3_STABLE Release: REL9_3_24 [9b9b622b2] 2018-06-12 11:13:22 -0700

    Fix bugs in vacuum of shared rels, by keeping their relcache entries current.

    When vacuum processes a relation it uses the corresponding relcache
    entry's relfrozenxid / relminmxid as a cutoff for when to remove
    tuples etc. Unfortunately for nailed relations (i.e. critical system
    catalogs) bugs could frequently lead to the corresponding relcache
    entry being stale.

    This set of bugs could cause actual data corruption as vacuum would
    potentially not remove the correct row versions, potentially reviving
    them at a later point.  After 699bf7d05c some corruptions in this vein
    were prevented, but the additional error checks could also trigger
    spuriously. Examples of such errors are:
      ERROR: found xmin ... from before relfrozenxid ...
    and
      ERROR: found multixact ... from before relminmxid ...
    To be caused by this bug the errors have to occur on system catalog
    tables.

    The two bugs are:

    1) Invalidations for nailed relations were ignored, based on the
       theory that the relcache entry for such tables doesn't
       change. Which is largely true, except for fields like relfrozenxid
       etc.  This means that changes to relations vacuumed in other
       sessions weren't picked up by already existing sessions.  Luckily
       autovacuum doesn't have particularly longrunning sessions.

    2) For shared *and* nailed relations, the shared relcache init file
       was never invalidated while running.  That means that for such
       tables (e.g. pg_authid, pg_database) it's not just already existing
       sessions that are affected, but even new connections are as well.
       That explains why the reports usually were about pg_authid et. al.

    To fix 1), revalidate the rd_rel portion of a relcache entry when
    invalid. This implies a bit of extra complexity to deal with
    bootstrapping, but it's not too bad.  The fix for 2) is simpler,
    simply always remove both the shared and local init files.

    Author: Andres Freund
    Reviewed-By: Alvaro Herrera
    Discussion:
        https://postgr.es/m/20180525203736.crkbg36muzxrjj5e@xxxxxxxxxxxxxxxxx
        https://postgr.es/m/CAMa1XUhKSJd98JW4o9StWPrfS=11bPgG+_GDMxe25TvUY4Sugg@xxxxxxxxxxxxxx
        https://postgr.es/m/CAKMFJucqbuoDRfxPDX39WhA3vJyxweRg_zDVXzncr6+5wOguWA@xxxxxxxxxxxxxx
        https://postgr.es/m/CAGewt-ujGpMLQ09gXcUFMZaZsGJC98VXHEFbF-tpPB0fB13K+A@xxxxxxxxxxxxxx
    Backpatch: 9.3-

That bug is pretty narrow, but it explains failures on pg_authid
and related catalogs.  So updating to 9.6.10 or later should be
enough to prevent a recurrence.

			regards, tom lane