First of all thanks to those who provided input. This problem is now fixed and I thought I would post this solution so that others might benefit in the future. For the sake of completeness: The error was that if "show all" was run on this postgresql (version 8.3) server, postgres would crash and then recover. Otherwise the server "seemed" healthy The postgres log showed: Sep 10 23:55:36 theconsole postgres[31118]: [4-1] 0: LOG: 00000: server process (PID 31145) was terminated by signal 11: Segmentation fault Sep 10 23:55:36 theconsole postgres[31118]: [4-2] 0: LOCATION: LogChildExit, postmaster.c:2529 Sep 10 23:55:36 theconsole postgres[31118]: [5-1] 0: LOG: 00000: terminating any other active server processes Sep 10 23:55:36 theconsole postgres[31118]: [5-2] 0: LOCATION: HandleChildCrash, postmaster.c:2374 Sep 10 23:55:36 theconsole postgres[31118]: [6-1] 0: LOG: 00000: all server processes terminated; reinitializing Sep 10 23:55:36 theconsole postgres[31118]: [6-2] 0: LOCATION: PostmasterStateMachine, postmaster.c:2690 Sep 10 23:55:36 theconsole postgres[31146]: [7-1] 0: LOG: 00000: database system was interrupted; last known up at 2009-09-10 23:55:14 EST Sep 10 23:55:36 theconsole postgres[31146]: [7-2] 0: LOCATION: StartupXLOG, xlog.c:4836 Sep 10 23:55:36 theconsole postgres[31147]: [7-1] [local] postgres postgres 0: FATAL: 57P03: the database system is in recovery mode Sep 10 23:55:36 theconsole postgres[31147]: [7-2] [local] postgres postgres 0: LOCATION: ProcessStartupPacket, postmaster.c:1648 Sep 10 23:55:36 theconsole postgres[31146]: [8-1] 0: LOG: 00000: database system was not properly shut down; automatic recovery in progress Sep 10 23:55:36 theconsole postgres[31146]: [8-2] 0: LOCATION: StartupXLOG, xlog.c:5003 Sep 10 23:55:36 theconsole postgres[31146]: [9-1] 0: LOG: 00000: record with zero length at 2A/E734761C Sep 10 23:55:36 theconsole postgres[31146]: [9-2] 0: LOCATION: ReadRecord, xlog.c:3126 Sep 10 23:55:36 theconsole postgres[31146]: [10-1] 0: LOG: 00000: redo is not required Sep 10 23:55:36 theconsole postgres[31146]: [10-2] 0: LOCATION: StartupXLOG, xlog.c:5146 Sep 10 23:55:36 theconsole postgres[31150]: [7-1] 0: LOG: 00000: autovacuum launcher started Sep 10 23:55:36 theconsole postgres[31150]: [7-2] 0: LOCATION: AutoVacLauncherMain, autovacuum.c:520 Sep 10 23:55:36 theconsole postgres[31118]: [7-1] 0: LOG: 00000: database system is ready to accept connections SOLUTION: Increase the memory on the server. WHY We had recently ( a month before) had installed splunk on the server. It was running ok The combination of splunk and other tasks running had pushed the memory too close. What we did not notice was that swap had been almost completely consumed - nasty RESULT We shut it all down, increased the memory (double) and voila - problem gone. It goes to show that when hunting problems we should not ignore the basic environmental elements. It also goes to show that our monitoring system was not looking at this relatively new server. (this confession is not an invitation for a spanking) again thanks for the help Grant On 11/09/2009, at 9:09 AM, Grant Maxwell wrote:
|