Re: select on 22 GB table causes "An I/O error occured while sending to the backend." exception

david@xxxxxxx · Thu, 28 Aug 2008 20:02:48 -0700 (PDT)

On Thu, 28 Aug 2008, Alvaro Herrera wrote:

david@xxxxxxx escribi?:
On Thu, 28 Aug 2008, Scott Marlowe wrote:

scenario 1:  There's a postmaster, it owns all the child processes.
It gets killed.  The Postmaster gets restarted.  Since there isn't one

when the postmaster gets killed doesn't that kill all it's children as
well?

Of course not.  The postmaster gets a SIGKILL, which is instant death.
There's no way to signal the children.  If they were killed too then
this wouldn't be much of a problem.

I'm not saying that it would signal it's children, I thought that the OS 
killed children (unless steps were taken to allow them to re-parent)

well, if you aren't going through the postmaster, what process is
recieving network messages? it can't be a group of processes, only one
can be listening to a socket at one time.

Huh?  Each backend has its own socket.

we must be talking about different things. I'm talking about the socket 
that would be used for clients to talk to postgres, this is either a TCP 
socket or a unix socket. in either case only one process can listen on it.

and if the postmaster isn't needed for the child processes to write to
the datastore, how are multiple child processes prevented from writing to
the datastore normally? and why doesn't that mechanism continue to work?

They use locks.  Those locks are implemented using shared memory.  If a
new postmaster starts, it gets a new shared memory, and a new set of
locks, that do not conflict with the ones already held by the first gang
of backends.  This is what causes the corruption.

so the new postmaster needs to detect that there is a shared memory 
segment out that used by backends for this database.

this doesn't sound that hard, basicly something similar to a pid file in 
the db directory that records what backends are running and what shared 
memory segment they are using.

this would be similar to the existing pid file that would have to be 
removed manually before a new postmaster can start (if it's not a graceful 
shutdown)

besides, some watchdog would need to start the new postmaster, that 
watchdog can be taught to kill off the child processes before starting a 
new postmaster along with clearing the pid file.

so are you saying that the only possible thing that can kill the
postmaster is the OOM killer? it can't possilby exit in any other
situation without the children being shutdown first?

I would be surprised if that was really true.

If the sysadmin sends a SIGKILL then obviously the same thing happens.

Any other signal gives it the chance to signal the children before
dying.

are you sure that it's not going to die from a memory allocation error? or 
any other similar type of error without _always_ killing the children?

David Lang