Warm standby recovery failure

Lee Azzarello <lee@xxxxxxxxxx> · Thu, 29 Jan 2009 14:47:14 -0500

I'm new to the list. So hello everyone. This morning my monitoring
system sent me an alert that my warm standby database had exited
recovery mode. I checked on it when I woke up and sure enough, it had
stopped looking for new WAL files and became available for queries.
After some digging, I uncovered these details, in a chronological
order:

*NAGIOS*
Date/Time: Thu Jan 29 12:36:04 UTC 2009
DISK CRITICAL - free space: /var/lib/postgresql 35 MB (0% inode=99%):

** MYSTERY PERIOD **
(two crontabs to do WAL maintance are scheduled for 04:00, not
applicable to this situation)

*NAGIOS*
Date/Time: Thu Jan 29 12:49:05 UTC 2009
DISK OK - free space: /var/lib/postgresql 14129 MB (29% inode=99%)

* me working in a shell on the host *
# date
Thu Jan 29 15:33:35 UTC 2009
# df -h /dev/sdc1
Filesystem            Size  Used Avail Use% Mounted on
/dev/sdc1              50G  3.3G   44G   7% /var/lib/postgresql

/var/log/postgresql/postgresql-8.3-main.log
2009-01-29 12:41:36 UTC LOG:  restored log file
"000000010000002F000000A9" from archive
size= [16777216]
copy /mnt/pg_xlog/000000010000002F000000AA pg_xlog/RECOVERYXLOG
`/mnt/pg_xlog/000000010000002F000000AA' -> `pg_xlog/RECOVERYXLOG'
`/mnt/pg_xlog/000000010000002F000000AA' -> `pg_xlog/./000000010000002F000000AA'
cp: writing `pg_xlog/./000000010000002F000000AA': No space left on device
2009-01-29 12:48:14 UTC LOG:  could not read from log file 47, segment
170, offset 3129344: Success
2009-01-29 12:48:14 UTC LOG:  redo done at 2F/AA2FBE08
2009-01-29 12:48:14 UTC LOG:  last completed transaction was at log
time 2009-01-29 12:42:50.572406+00
size= [16777216]
copy /mnt/pg_xlog/000000010000002F000000AA pg_xlog/RECOVERYXLOG
`/mnt/pg_xlog/000000010000002F000000AA' -> `pg_xlog/RECOVERYXLOG'
cp: overwrite `pg_xlog/./000000010000002F000000AA'? 2009-01-29
12:48:14 UTC LOG:  restored log file "000000010000002F000000AA" from
archive
pg_xlog/RECOVERYHISTORY
2009-01-29 12:48:14 UTC LOG:  selected new timeline ID: 2
pg_xlog/RECOVERYHISTORY
2009-01-29 12:48:14 UTC LOG:  archive recovery complete

# /etc/init.d/postgresql-8.3 start
...
# ps ax | grep postgres
 3499 ?        Ss     0:00 postgres: startup process   waiting for
000000020000002F000000AA
 3503 ?        S      0:00 /bin/bash
/var/lib/postgresql/bin/recover.sh pg_xlog/RECOVERYXLOG
000000020000002F000000AA

WTF? (before I knew about timelines)

# ls -l /mnt/pg_xlog/000000020000002F000000AA
ls: cannot access /mnt/pg_xlog/000000020000002F000000AA: No such file
or directory

# updatedb
# locate 000000020000002F000000AA
(nothing)
# rm /var/lib/postgresql/8.3/main/pg_xlog/0*
(perhaps a bad move? I still have these in /mnt/pg_xlog)
# /etc/init.d/postgresql-8.3 restart

Same problem. WTF?

# rm /var/lib/postgresql/8.3/main/pg_xlog/archive_status/00000002.history.ready
# /etc/init.d/postgresql-8.3 restart
# ps ax | grep postgres
 3499 ?        Ss     0:00 postgres: startup process   waiting for
000000020000002F000000AA
 3503 ?        S      0:00 /bin/bash
/var/lib/postgresql/bin/recover.sh pg_xlog/RECOVERYXLOG
000000020000002F000000AA

Why is it looking for 000000020000002F000000AA? It should be looking
for 000000010000002F000000AA.
This seems to be a history problem with timelines. Recovery branched
into a new timeline, which is where it is now. I need to force it back
to the previous timeline so it can resume recovery. I'm building a
clean recovery database now but I'd like to get to the bottom of this
and bring the failed recovery database back into recovery mode so I
can learn how to get out of this situation more gracefully.

Thanks for any help you can offer.
_____________
Lee Azzarello
Staff Hacker
drop.io

-- 
Sent via pgsql-admin mailing list (pgsql-admin@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-admin