Re: ERROR: cannot freeze committed xmax

Sasha Aliashkevich <olsender@xxxxxxxxx> · Fri, 16 Jul 2021 17:51:36 +0200

Hi Alvaro, glad to hear from you!

This database is relatevely young, it was initdb'ed about a year ago or so and it was initially at 10.x. I don't know the exact minor version but the major version was 10 for sure. 

The problematic row is actually visible:

    SELECT COUNT(*) FROM pg_proc WHERE ctid = '(75,19)'

     count 
    -------
         1
    (1 row) 

And it refers to a known custom stored function.

Also I'm not sure if I can really skip freezing at this point because this table is already over autovacuum_freeze_max_age (correct me if I'm wrong):

    SELECT age(c.relfrozenxid), s.setting autovacuum_freeze_max_age 
    FROM pg_class c, pg_settings s 
    WHERE c.relname = 'pg_proc' AND s.name = 'autovacuum_freeze_max_age';

        age    | autovacuum_freeze_max_age 
    -----------+---------------------------
     213791108 | 200000000
    (1 row)

This is a production database which I can not restart at any point unfortunately, so I can't change the global settings atm.

We can sacrifice that function and recreate it, but unfortunately when I execute DROP it fails with the error:

    ERROR:  could not find tuple for rule 16396

I guess it's because of the broken pages at pg_depend.
Do you think it is safe to execute DELETE on the corresponding records in pr_proc and pro_depend? Would it update the broken xmax?

I feel like I'm dealing hear with some sort of data curruption.
I recall that few months ago we already had issues with the same function and view based on it. At that point for some reason it just stopped working. So we had to DROP and RECREATE it.
It fixed the issue, but apparently the data inconsistency stayed on disk and now we've got to the situation where AUTOVACUUM stopped working and it makes me really nervous.

> On 15 Jul 2021, at 15:41, Alvaro Herrera <alvherre@xxxxxxxxxxxxxx> wrote:
> 
> One thing I forgot is that these XIDs are fairly old, perhaps dating
> back to when this database was freshly initdb'd if there has been no XID
> wraparound.  In that case you were probably running a version much older
> than 10.14 when they were written.  Do you happen to know when did you
> initdb this, with what version, when did you upgrade this to 10.14?
> That may help search the commit log for bugfixes that might explain the
> bug.  I just remembered this one as my favorite candidate:
> 
> Author: Alvaro Herrera <alvherre@xxxxxxxxxxxxxx>
> Branch: master Release: REL_11_BR [d2599ecfc] 2018-05-04 18:24:45 -0300
> Branch: REL_10_STABLE Release: REL_10_4 [e1d634758] 2018-05-04 18:23:58 -0300
> Branch: REL9_6_STABLE Release: REL9_6_9 [3a11485a5] 2018-05-04 18:23:30 -0300
> 
>    Don't mark pages all-visible spuriously
> 
>    Dan Wood diagnosed a long-standing problem that pages containing tuples
>    that are locked by multixacts containing live lockers may spuriously end
>    up as candidates for getting their all-visible flag set.  This has the
>    long-term effect that multixacts remain unfrozen; this may previously
>    pass undetected, but since commit XYZ it would be reported as
>      "ERROR: found multixact 134100944 from before relminmxid 192042633"
>    because when a later vacuum tries to freeze the page it detects that a
>    multixact that should have gotten frozen, wasn't.
> 
>    Dan proposed a (correct) patch that simply sets a variable to its
>    correct value, after a bogus initialization.  But, per discussion, it
>    seems better coding to avoid the bogus initializations altogether, since
>    they could give rise to more bugs later.  Therefore this fix rewrites
>    the logic a little bit to avoid depending on the bogus initializations.
> 
>    This bug was part of a family introduced in 9.6 by commit a892234f830e;
>    later, commit 38e9f90a227d fixed most of them, but this one was
>    unnoticed.
> 
>    Authors: Dan Wood, Pavan Deolasee, Álvaro Herrera
>    Reviewed-by: Masahiko Sawada, Pavan Deolasee, Álvaro Herrera
>    Discussion: https://postgr.es/m/84EBAC55-F06D-4FBE-A3F3-8BDA093CE3E3@xxxxxxxxxx
> 
> 
> -- 
> Álvaro Herrera         PostgreSQL Developer  —  https://www.EnterpriseDB.com/
> "El número de instalaciones de UNIX se ha elevado a 10,
> y se espera que este número aumente" (UPM, 1972)