I'm new to the list. So hello everyone. This morning my monitoring system sent me an alert that my warm standby database had exited recovery mode. I checked on it when I woke up and sure enough, it had stopped looking for new WAL files and became available for queries. After some digging, I uncovered these details, in a chronological order: *NAGIOS* Date/Time: Thu Jan 29 12:36:04 UTC 2009 DISK CRITICAL - free space: /var/lib/postgresql 35 MB (0% inode=99%): ** MYSTERY PERIOD ** (two crontabs to do WAL maintance are scheduled for 04:00, not applicable to this situation) *NAGIOS* Date/Time: Thu Jan 29 12:49:05 UTC 2009 DISK OK - free space: /var/lib/postgresql 14129 MB (29% inode=99%) * me working in a shell on the host * # date Thu Jan 29 15:33:35 UTC 2009 # df -h /dev/sdc1 Filesystem Size Used Avail Use% Mounted on /dev/sdc1 50G 3.3G 44G 7% /var/lib/postgresql /var/log/postgresql/postgresql-8.3-main.log 2009-01-29 12:41:36 UTC LOG: restored log file "000000010000002F000000A9" from archive size= [16777216] copy /mnt/pg_xlog/000000010000002F000000AA pg_xlog/RECOVERYXLOG `/mnt/pg_xlog/000000010000002F000000AA' -> `pg_xlog/RECOVERYXLOG' `/mnt/pg_xlog/000000010000002F000000AA' -> `pg_xlog/./000000010000002F000000AA' cp: writing `pg_xlog/./000000010000002F000000AA': No space left on device 2009-01-29 12:48:14 UTC LOG: could not read from log file 47, segment 170, offset 3129344: Success 2009-01-29 12:48:14 UTC LOG: redo done at 2F/AA2FBE08 2009-01-29 12:48:14 UTC LOG: last completed transaction was at log time 2009-01-29 12:42:50.572406+00 size= [16777216] copy /mnt/pg_xlog/000000010000002F000000AA pg_xlog/RECOVERYXLOG `/mnt/pg_xlog/000000010000002F000000AA' -> `pg_xlog/RECOVERYXLOG' cp: overwrite `pg_xlog/./000000010000002F000000AA'? 2009-01-29 12:48:14 UTC LOG: restored log file "000000010000002F000000AA" from archive pg_xlog/RECOVERYHISTORY 2009-01-29 12:48:14 UTC LOG: selected new timeline ID: 2 pg_xlog/RECOVERYHISTORY 2009-01-29 12:48:14 UTC LOG: archive recovery complete # /etc/init.d/postgresql-8.3 start ... # ps ax | grep postgres 3499 ? Ss 0:00 postgres: startup process waiting for 000000020000002F000000AA 3503 ? S 0:00 /bin/bash /var/lib/postgresql/bin/recover.sh pg_xlog/RECOVERYXLOG 000000020000002F000000AA WTF? (before I knew about timelines) # ls -l /mnt/pg_xlog/000000020000002F000000AA ls: cannot access /mnt/pg_xlog/000000020000002F000000AA: No such file or directory # updatedb # locate 000000020000002F000000AA (nothing) # rm /var/lib/postgresql/8.3/main/pg_xlog/0* (perhaps a bad move? I still have these in /mnt/pg_xlog) # /etc/init.d/postgresql-8.3 restart Same problem. WTF? # rm /var/lib/postgresql/8.3/main/pg_xlog/archive_status/00000002.history.ready # /etc/init.d/postgresql-8.3 restart # ps ax | grep postgres 3499 ? Ss 0:00 postgres: startup process waiting for 000000020000002F000000AA 3503 ? S 0:00 /bin/bash /var/lib/postgresql/bin/recover.sh pg_xlog/RECOVERYXLOG 000000020000002F000000AA Why is it looking for 000000020000002F000000AA? It should be looking for 000000010000002F000000AA. This seems to be a history problem with timelines. Recovery branched into a new timeline, which is where it is now. I need to force it back to the previous timeline so it can resume recovery. I'm building a clean recovery database now but I'd like to get to the bottom of this and bring the failed recovery database back into recovery mode so I can learn how to get out of this situation more gracefully. Thanks for any help you can offer. _____________ Lee Azzarello Staff Hacker drop.io -- Sent via pgsql-admin mailing list (pgsql-admin@xxxxxxxxxxxxxx) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-admin