Hi!
We could reproduce the start-up problem on Windows 2003. After a reboot, postmaster, in its start-up sequence cleans up old temporary files, and this step used to take several minutes (a little over 4 minutes), delaying the writing of line 6 onwards into the PID file. This delay caused pg_ctl to timeout, leaving behind an orphaned postgres.exe process (which eventually forks off many other postgres.exe processes). But since pg_ctl itself isn't running after the timeout, Windows thinks the service isn't running. A subsequent attempt to start the service using pg_ctl now complains about the existing lock file still being used by one of the postgres.exe processes that was spawned before.
We could reproduce the start-up problem on Windows 2003. After a reboot, postmaster, in its start-up sequence cleans up old temporary files, and this step used to take several minutes (a little over 4 minutes), delaying the writing of line 6 onwards into the PID file. This delay caused pg_ctl to timeout, leaving behind an orphaned postgres.exe process (which eventually forks off many other postgres.exe processes). But since pg_ctl itself isn't running after the timeout, Windows thinks the service isn't running. A subsequent attempt to start the service using pg_ctl now complains about the existing lock file still being used by one of the postgres.exe processes that was spawned before.
On Tue, May 8, 2012 at 12:13 PM, deepak <deepak.pn@xxxxxxxxx> wrote:
On Tue, May 8, 2012 at 3:09 AM, Alban Hertroys <haramrae@xxxxxxxxx> wrote:
On 8 May 2012, at 24:34, deepak wrote:No, it means that postgres wasn't shut down properly when Windows shut down. Removing the pid-file is one of the last things the shut-down procedure does. The file is used to prevent 2 instances of the same server running on the same data-directory.
> Hi,
>
> On Windows 2008, sometimes the server fails to start due to an existing "postmaster.pid' file.
>
> I tried rebooting a few times and even force shutting down the server, and it started up fine.
> It seems to be a race-condition of sorts in the code that detects whether the process with PID
> in the file is running or not.
If it's a race-condition, it's probably one in Microsoft's shutdown code. I've seen similar problems with Outlook mailboxes on a network directory; Windows unmounts the remote file-systems before Outlook finished updating its files under that mount point, so Outlook throws an error message and Windows doesn't shut down because of that.
I don't suppose that pid-file is on a remote file-system?
No, it's local.
> Does any one have this same problem? Any way to fix it besides removing the PID fileYou could probably script removal of the pid file if its creation date is before the time the system started booting up.
> manually each time the server complains about this?
Thanks, it looks like the code already seems to overwrite an old pid file if no other process is using it (if I understand the code correctly, it just echoes a byte onto a pipe to detect this).
Still, I can't see under what conditions this occurs, but I have seen it happen a couple of times, just that I don't know how to predictably reproduce the problem.
--
Deepak
We have observed conclusively that file system cache is coming into play. We tested the scenario in which a reboot was followed by navigating the file system under the data directory using "find" Cygwin command, following which there was "no" timeout for pg_ctl and the server started up fine, suggesting that the clean up is way faster when the file system is cached.
Any ideas on fixing this start-up delay in postmaster?
Could the task of cleanup move elsewhere, specifically to somewhere after the writing of PID file is complete so that pg_ctl doesn't timeout?
Any other suggestions for working around this problem?
Thanks,
Deepak