On Sun, Jul 13, 2014 at 12:07 PM, Ru Devel <rudevel@xxxxxxxxx> wrote:
Hello,I have postgres 9.3.4 running on linux, and ~20 databases in the cluster.All the cluster was migrated from 9.2 using pg_upgradecluster.After migration autovacuum started to fail in one database, causing entire cluster crashes:2014-07-13 21:16:24 MSK [5665]: [1-1] db=,user= PANIC: corrupted item pointer: offset = 5292, size = 242014-07-13 21:16:24 MSK [29131]: [417-1] db=,user= LOG: server process (PID 5665) was terminated by signal 6: Aborted2014-07-13 21:16:24 MSK [29131]: [418-1] db=,user= DETAIL: Failed process was running: autovacuum: VACUUM public.postfix_stat0 (to prevent wraparound)2014-07-13 21:16:24 MSK [29131]: [419-1] db=,user= LOG: terminating any other active server processes2014-07-13 21:16:24 MSK [29597]: [1-1] db=,user= WARNING: terminating connection because of crash of another server processI have two questions:1) why in case of some problem with only one database, only one place of memory we have entire-server problem? The database with problem is not important but this corrupted memory inside it leads to frequent cluster-wide restart so all my server suffering from this local problem.Why postmaster should restart all backends if only one dies?
In general what this means is that the error has occurred in a "critical section". The backend has taken a lock to protect a part of shared memory, and (possibly) made changes that put the shared memory into an inconsistent state, but now it cannot complete the process, putting shared memory back into a new consistent state. It cannot release the lock it holds, because that would allow other processes to see this inconsistent state. So restarting entire system is the only alternative. This is drastic, and that is why they try to make critical sections as small as possible.
It is possible that this code does not really need to be in a "critical section", but that it is just the case that no one has done the work to rearrange the code to take it out of the critical section.
2) what is the best modern way to analyze and fix such an issue?
Is the problem reproducible? That is, if you restore the last physical backup of your pre-upgrade database to a test server and run pg_upgrade on that, do you get the same problem?
Did you get a core dump out of the panic, which you can attach to with gdb and get a backtrace? (in which case, you should probably take it to the pgsql-hackers mailing list.)
If your top concern is getting all the other databases back as soon as possible, you should be able to just drop the corrupted database (after making a full backup). Then you can worry about recovering that database and rejoining it at your leisure.
Cheers,
Jeff