Re: corrupted item pointer in streaming based replication

Lonni J Friedman <netllama@xxxxxxxxx> · Wed, 3 Apr 2013 13:06:13 -0700

You should figure out what base/16384/114846.39 corresponds to inside
the database.  If you're super lucky its something unimportant and/or
something that can be recreated easily (like an index).  If its
something important, then you're only option is to try to drop the
object and restore it from the last known good backup.

On Wed, Apr 3, 2013 at 1:02 PM, Jigar Shah <jshah@xxxxxxxxxxx> wrote:
> Hi,
>
> Postgres version = 9.1.2
> OS = debian(6.0.7)
> fsync = on
> full_page_writes = on
> Setup = Primary and streaming replication based secondary
>
> Few days ago we had a situation where our Primary started to through the
> error messages below indicating corruption in the database. It crashed
> sometimes and showed a panic message in the logs
>
> 2013-03-25 07:30:39.545 PDT PANIC:  corrupted item pointer: offset = 0, size
> = 0
> 2013-03-25 07:30:39.704 PDT LOG:  server process (PID 8715) was terminated
> by signal 6: Aborted
> 2013-03-25 07:30:39.704 PDT LOG:  terminating any other active server
> processes
>
> Days before it started to crash it showed the below error messages in the
> logs.
>
> [d: u:postgres p:2498 7] ERROR: could not access status of transaction
> 837550133
> DETAIL: Could not open file "pg_clog/031E": No such file or directory.
> [u:postgres p:2498 9]
>
> [d: u:radio p:31917 242] ERROR: could not open file "base/16384/114846.39"
> (target block 360448000): No such file or directory [d: u:radio p:31917 243]
>
> On top of that, our secondaries are now crashed and would not startup and
> showed the error messages below in pg logs.
>
> 2013-03-27 11:00:47.281 PDT LOG:  recovery restart point at 161A/17108AA8
> 2013-03-27 11:00:47.281 PDT DETAIL:  last completed transaction was at log
> time 2013-03-27 11:00:47.241236-07
> 2013-03-27 11:00:47.520 PDT LOG:  restartpoint starting: xlog
>
> 2013-03-27 11:07:51.348 PDT FATAL:  corrupted item pointer: offset = 0, size
> = 0
> 2013-03-27 11:07:51.348 PDT CONTEXT:  xlog redo split_l: rel
> 1663/16384/115085 left 4256959, right 5861610, next 5044459, level 0,
> firstright 192
> 2013-03-27 11:07:51.716 PDT LOG:  startup process (PID 5959) exited with
> exit code 1
> 2013-03-27 11:07:51.716 PDT LOG:  terminating any other active server
> processes
>
> At this point we have a running but corrupt primary and crashed secondary
> that wont startup.
>
> I am wondering what are our options at this point. Can we do something to
> fix this? How can we recover from corruption.
>
> Thanks for help in advance.
>
> Regards
> Jigar
>
>

-- 
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
L. Friedman                                    netllama@xxxxxxxxx
LlamaLand                       https://netllama.linux-sxs.org

-- 
Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general