Our PostgreSQL 15.2 instance running on Ubuntu 18.04 has crashed
with this error:
2023-04-05 09:24:03.448 UTC [15227]
ERROR: index "pg_class_oid_index" contains unexpected zero
page at block 0
2023-04-05 09:24:03.448 UTC [15227]
HINT: Please REINDEX it.
...
2023-04-05 13:05:25.018 UTC [15437]
root@test_behavior_638162834106895162 FATAL: index
"pg_class_oid_index" contains unexpected zero page at block 0
2023-04-05 13:05:25.018 UTC [15437]
root@test_behavior_638162834106895162 HINT: Please REINDEX
it.
... (same error for a few more DBs)
2023-04-05 13:05:25.144 UTC [16965]
root@test_behavior_638162855458823077 FATAL: index
"pg_class_oid_index" contains unexpected zero page at block 0
2023-04-05 13:05:25.144 UTC [16965]
root@test_behavior_638162855458823077 HINT: Please REINDEX
it.
...
2023-04-05 13:05:25.404 UTC [17309]
root@test_behavior_638162881641031612 PANIC: could not open
critical system index 2662
2023-04-05 13:05:25.405 UTC [9372]
LOG: server process (PID 17309) was terminated by signal 6:
Aborted
2023-04-05 13:05:25.405 UTC [9372]
LOG: terminating any other active server processes
We had the same thing happened about
a month ago on a different database on the same cluster. For a
while PG actually ran OK as long as you didn't access that
specific DB, but when trying to back up that DB with pg_dump
it would crash every time. At that time one of the disks
hosting the ZFS dataset with the PG data directory on it was
reporting errors, so we thought it was likely due to that.
Unfortunately, before we could
replace the disks, PG crashed completely and would not start
again at all, so I had to rebuild the cluster from scratch and
restore from pg_dump backups (still onto the old, bad disks).
Once the disks were replaced (all of them) I just copied the
data to them using zfs send | zfs receive and didn't bother
restoring pg_dump backups again - which was perhaps foolish in
hindsight.
Well, yesterday it happened again.
The server still restarted OK, so I took fresh pg_dump backups
of the databases we care about (which ran fine), rebuilt the
cluster and restored the pg_dump backups again - now onto the
new disks, which are not reporting any problems.
So while everything is up and
running now this error has me rather concerned. Could the
error we're seeing now have been caused by some corruption in
the PG data that's been there for a month (so it could still
be attributed to the bad disk), which should now be fixed by
having restored from backups onto good disks? Could this be a
PG bug? What can I do to figure out why this is happening and
prevent it from happening again? Advice appreciated!
|