Re: help with data recovery from injected UPDATE

Gus Gutoski <shared.entanglement@xxxxxxxxx> · Sun, 14 Jun 2009 10:32:38 -0400

Merlin Moncure wrote:
>> postgresql  8.1 supports pitr archiving.  you can
>> do continuous backups and restore the database to just before the bad
>> data.

I tried using point-in-time-recovery to restore the state of the
database immediately before the corruption.  It didn't work, but it
was quite a show.  Here's the story.

After much wailing and gnashing of teeth, I got postmaster to execute
a recovery (so that recovery.conf was renamed to recovery.done).  But
the database was completely screwed after the recovery.  Here's an
example of the kind of output I saw while executing a simple SELECT
statement:

postgres=# SELECT entry_date,machine_id,coin FROM collections WHERE
entry_date::date>'2009-06-06' ORDER BY entry_date;
WARNING:  could not write block 32 of 1663/10793/2608
DETAIL:  Multiple failures --- write error may be permanent.
ERROR:  xlog flush request 0/4DC6CC88 is not satisfied --- flushed
only to 0/4DC06180
CONTEXT:  writing block 32 of relation 1663/10793/2608

Here's the recovery procedure I followed:
0. Shortly after the corruption on June 9, 2009, I shut down the
server and backed up the entire data directory.  The recovery
procedure described herein begins with this file system backup.
1. The most recent non-corrupted snapshot of the database is a pg_dump
from May 13, 2009.  (I don't have any file system backups from before
the corruption.)  I restored the database to this snapshot by
executing the commands from the May 13 pg_dump on the June 9 corrupted
data.
2.  I removed the files in the pg_xlog directory and replaced them
with the contents of pg_xlog from the corrupted file system backup
from June 9.
3.  I modified the sample recovery.conf file so as to replay all the
transactions right up until the point of corruption.  The hope was
that postmaster would somehow know to begin replaying transactions at
the appropriate point from the May 13 state.

I guess it's too much to ask postmaster to do a PITR from a pg_dump
backup, as opposed to a file system backup.  Bummer.

By the way, I can reliably get postmaster to hang during startup if I
manually create the pg_xlog\RECOVERYHISTORY and pg_xlog\RECOVERYXLOG
directories (even with correct permissions) before starting up the
server.  When I say "hang", I mean that
(i) any attempt to connect is met with the response "FATAL: the
database system is starting up", and
(ii) "pg_ctl stop" cannot be used to shut down the server -- I have to
use "pg_ctl kill"

Anyway, I'm going to try implementing Tom's suggestion of writing a
program to modify the xmin/xmax values.  I expect this approach won't
work, as autovacuum was on at the time of corruption.  However, the
files in the data directory are quite large -- many times larger than
a pg_dump.  The database sees such a small amount of traffic that it's
possible that even vacuum decided not to bother reclaiming the unused
storage created by the corrupting transaction (?).  Here's hoping.

-Gus

On Thu, Jun 11, 2009 at 1:43 PM, Gus Gutoski wrote:
> Thanks for the replies.
>
> Tom Lane wrote:
>> This being 8.1, if you haven't turned on autovacuum there is some chance
>> of that.
>
> Unfortunately, autovacuum was on.  I don't recall ever turning it on,
> but this database is over two years old; it's possible that I blindly
> followed advice from pgAdmin or something way back when.
>
> Merlin Moncure wrote:
>> does postgresql 8.1 have last_vacuum time in the pg_stat_all_tables
>> coiumn?
>
> I did not see a column called last_vacuum in the pg_stat_all_tables table.
>
>> postgresql  8.1 supports pitr archiving.  you can
>> do continuous backups and restore the database to just before the bad
>> data.
>
> I'm learning about this now.  I'm cautiously optimistic, as my pg_xlog
> directory contains some files whose timestamp is near the time of the
> 'incident'.
>
> By "backup" do you mean the contents of a pg_dump?  The most recent
> dump was two months ago.  I'm worried that my log files might not go
> far enough back in time to restore the table from the most recent
> dump.
>
> Both Tom's and Merlin's suggestions carry a significant learning
> curve.  I'll do what I can in the coming days and post to the list if
> anything noteworthy happens.
>
>> plus, there is no way you are escaping the obligatory 'where are your
>> backups?'.  :-).
>
> It's a classic story.  I'm volunteering about one day per month for
> this project, learning SQL as I go.  Priority was always given to the
> "get it working" tasks and never the "make it safe" tasks.  I had/have
> grandiose plans to rewrite the whole system properly after I graduate.
>  Unfortunately, the inevitable corruption didn't wait that long.
>
> Cheers.
>
> -Gus
>

-- 
Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general