Re: aborting startup due to startup process failure

Tom Arthurs <tarthurs@xxxxxxxxxxxx> · Thu, 28 Jun 2007 14:46:33 -0700

I think you may have a race condition in your code -- you don't find the 
new file, sleep, while sleeping both the new file and the stop file come 
in, you wake up, find the stop file and never copy the last segment over.

George Wilk wrote:

I posted to this group before with the same topic but nobody replied. 
Please, provide some feedback if you can…

I am running a warm standby server, which executes the following 
command in a recovery mode:

*triggered=false*

*while (test ! -f /var/ipsc/WAL/$1 && ! $triggered)*

*do*

* echo waiting for file: $1*

* *

* sleep 30*

* *

* if test -f /var/ipsc/pgsql/trigger*

* then*

* echo --- trigger found ---*

* echo --- exiting recovery mode ---*

* triggered=true*

* fi*

* *

*done*

* *

*if ( ! $triggered)*

*then*

* cp /var/ipsc/WAL/$1 $2*

*else*

* exit 133*

*fi*

Recovery command works just fine restoring data from the WAL files 
scp’d from the primary server. While in the recovery mode, when I 
create the trigger file breaking the while loop in recovery command, 
postgres does not go gently into the active database mode. Here is output:

*waiting for file: 00000001000000000000003A*

*--- trigger found ---*

*--- exiting recovery mode ---*

*FATAL: could not restore file "00000001000000000000003A" from 
archive: return code 34048*

*LOG: startup process (PID 13994) exited with exit code 1*

*LOG: aborting startup due to startup process failure*

* *

After finding the trigger file my recovery_cmd returns non-zero code. 
Why am I still getting *FATAL: could not restore file *?

Both my primary and standby servers run on Solaris 10 under SMF. When 
the standby server is attempting to change mode from recovery to 
regular database mode, there might be a race condition there between 
SMF trying to restart the server and the server trying to restart 
itself… or am I just hallucinating…

Thanks in advance for your comments.

Cheers,

~george