On 28/07/11 18:06, Gavrina, Irina wrote: > Yes, you are right, this is a question of database robustness. Honestly, this is the wrong approach to high availability. Don't try to make one machine unbreakable - you *cannot* do it. Use multi-machine failover between a master and a replica with STONITH and a heartbeat/monitoring setup. See Google for more info. > You know, > If any of postgres processes died the postmaster process will again > start up it. The main worry is if the postmaster process dies (truth be > told, I can reproduce it only using ‘kill –KILL’, which should be never > done, I know) and all other processes are up and running. All created > connections will go on with their work. And it means all db queries will > be processed after the postmaster died. Am I correct? > > Is there any risk of losing of data or data corruption in this case? Possibly, if you then remove postmaster.pid, thus bypassing the other safety, and start a new postmaster. I'm a bit alarmed to discover that Ubuntu's postgresql packaging appears to do exactly this. /usr/bin/pg_ctlcluster, from line 78: # If a pid file is already present, delete it if it is stale/invalid, # or exit with a notice if it belongs to an already running postmaster. sub start_check_pid_file { my $pidfile = $info{'pgdata'}.'/postmaster.pid'; if (check_running_postmaster $pidfile) { print "Cluster is already running.\n"; exit 2; } # Remove invalid or stale PID file if (-e $pidfile) { unlink $pidfile; print "Removed stale pid file.\n"; } } It merrily removes the pid file without any effort to check if there are backends using that datadir, then goes on to relaunch the postmaster. Is this sane? Safe? I realize that they want to have Pg come back up properly after an OS crash or hard power loss, where it doesn't get the chance to remove the pid file. Fine. It also opens the system up to user "kill -9" abuse, though... > There is any way to configure the postgres db like ‘if the postmaster > process died all child processes are stopped’? The whole issue with kill -9 is that you're telling the system to bypass *ALL* of the usual safeties and recover opportunities, including any crash handler the program may have in place. You're denying the postmaster any opportunity to warn the child processes that it is crashing out. "If I shoot my foot off with this shotgun, is there any way I can automatically stop the bleeding and graft a new foot on?" No. Don't shoot your foot off just because UNIX systems give you the ability to do just that if you feel like it. You don't "dd if=/dev/zero of=/dev/sdzz1" either... (target node changed in case any idiot tries to run it anyway). That said - a quick test shows that the postmaster dies without signalling its children when killed with a SIGABRT (often raised by glibc memory function errors, assertions, etc) too. A query sitting in a big cartesian join merrily continues processing. Whether that applies to queries that're doing write activity, I'm not sure. Anyway, something like this only-very-briefly-tested bash code should give you vaguely the right idea, though I *STRONGLY* recommend you use external monitoring like nagios or the like, plus a proper failover setup, instead of a hackish self-restart facility that's likely to make things WORSE not better. function capture_cash_info() { # Capture `dmesg' output, the last 1000 lines of the # postmaster logs, etc etc. You'll have to write this. # The function should email a report to the admin for urgent # examination. echo "Capture crash report" } function scream_for_help() { # Something unrecoverable happened. Email the admin, TXT them, # activate their house alarm, and electrocute them awake, do # whatever you usually have to do. # You'll have to write this. echo 'Help me! ARgh!' } function try_to_recover() { # If you want to know it, $1 will contain "stopped" if pidfile # missing, or "died" if pidfile present but stale. Can be used in # admin email reports etc. if test $(pgrep postgres|wc -l) -gt 0 ; then echo 1>&2 "Postmaster dead but children still running!"; # Email/TXT the admin screaming for help and saying some idiot # "kill -9"'d the postmaster # If you felt brave you could SIGTERM the children here then # next time the watchdog ran might be able to relaunch the # postmaster if they exited. I wouldn't want to, though. scream_for_help else capture_cash_info # relaunch pg using your distro's init script or using pg_ctl, # depending on how you usually do it false || scream_for_help fi } DATADIR=/var/lib/postgresql/8.4/main/ PID=$DATADIR/postmaster.pid if ! test -e "$PID" then try_to_recover stopped elif ! kill -0 `cat "$PID"` >&/dev/null; then try_to_recover died fi If you use the above and it eats your data then eats your cat, don't blame me. I told you not to use it. -- Craig Ringer -- Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general