"PANIC: could not open critical system index 2662" - twice

Evgeny Morozov <postgresql3@xxxxxxxxxxxxxxxxx> · Thu, 6 Apr 2023 16:41:56 +0000



        Our PostgreSQL 15.2 instance running on Ubuntu 18.04 has crashed
        with this error:
        

        2023-04-05 09:24:03.448 UTC [15227]
          ERROR:  index "pg_class_oid_index" contains unexpected zero
          page at block 0
        2023-04-05 09:24:03.448 UTC [15227]
          HINT:  Please REINDEX it.
        ...
        2023-04-05 13:05:25.018 UTC [15437]
          root@test_behavior_638162834106895162 FATAL:  index
          "pg_class_oid_index" contains unexpected zero page at block 0
        2023-04-05 13:05:25.018 UTC [15437]
          root@test_behavior_638162834106895162 HINT:  Please REINDEX
          it.
        ... (same error for a few more DBs)
        2023-04-05 13:05:25.144 UTC [16965]
          root@test_behavior_638162855458823077 FATAL:  index
          "pg_class_oid_index" contains unexpected zero page at block 0
        2023-04-05 13:05:25.144 UTC [16965]
          root@test_behavior_638162855458823077 HINT:  Please REINDEX
          it.
        ...
        2023-04-05 13:05:25.404 UTC [17309]
          root@test_behavior_638162881641031612 PANIC:  could not open
          critical system index 2662
        2023-04-05 13:05:25.405 UTC [9372]
          LOG:  server process (PID 17309) was terminated by signal 6:
          Aborted
        2023-04-05 13:05:25.405 UTC [9372]
          LOG:  terminating any other active server processes
        

        We had the same thing happened about
          a month ago on a different database on the same cluster. For a
          while PG actually ran OK as long as you didn't access that
          specific DB, but when trying to back up that DB with pg_dump
          it would crash every time. At that time one of the disks
          hosting the ZFS dataset with the PG data directory on it was
          reporting errors, so we thought it was likely due to that.
        

        Unfortunately, before we could
          replace the disks, PG crashed completely and would not start
          again at all, so I had to rebuild the cluster from scratch and
          restore from pg_dump backups (still onto the old, bad disks).
          Once the disks were replaced (all of them) I just copied the
          data to them using zfs send | zfs receive and didn't bother
          restoring pg_dump backups again - which was perhaps foolish in
          hindsight.
        

        Well, yesterday it happened again.
          The server still restarted OK, so I took fresh pg_dump backups
          of the databases we care about (which ran fine), rebuilt the
          cluster and restored the pg_dump backups again - now onto the
          new disks, which are not reporting any problems.
        

        So while everything is up and
          running now this error has me rather concerned. Could the
          error we're seeing now have been caused by some corruption in
          the PG data that's been there for a month (so it could still
          be attributed to the bad disk), which should now be fixed by
          having restored from backups onto good disks? Could this be a
          PG bug? What can I do to figure out why this is happening and
          prevent it from happening again? Advice appreciated!