Re: EPIPE

quinet@xxxxxxxxxx (Raphael Quinet) · Fri, 12 May 2000 17:43:20 +0200 (MET DST)

On Thu, 11 May 2000, Michael Natterer <mitschel@xxxxxxxxxxxxxxx> wrote:
> Raphael Quinet wrote:
> > No, actually it is not safe on all operating systems: as I wrote
> > elsewhere, you cannot always rely on SA_NODEFER.  This means that in
> > some cases, you could miss a SIGCHLD signal that occurs while you are
> > still inside the handler but after the last test on waitpid().  If
> > this happens, the main app will not see that one of the plug-ins has
> > died (until another one dies and the handler collects the status for
> > both).  That's why it is safer to make the tests outside the signal
> > handler.  Otherwise, you could have a race condition on some systems
> > (very seldom, but still...).
> 
> We don't use SA_NODEFER any more. And AFAIK the delivery of SIGCHLD has
> nothing to do with cleaning up zombies. This is why we loop around
> waitpid() because POSIX explicitly says that signals arriving close
> together may be merged by the kernel.

The delivery of SIGCHLD does not clean up the zombies: this is done by
calling waitpid() or any wait*() function.  But the call to waitpid()
will be triggered by a SIGCHLD, and there is a potential race
condition.  I will try to explain this in a slightly different way:
regardless of the setting of SA_NODEFER, you cannot ensure that you
will get one SIGCHLD for every dead process.  Some kernels merge the
signals that arrive close together, and some of them mask the signal
while the corresponding handler is being called (some systems will
re-deliver the masked signal immediately after the handler returns,
but this is not always the case).

Now, consider the following scenario for a race condition:
  * The Gimp starts two plug-ins
  * plug-in 1 dies
  * SIGCHLD delivered, handler called
  * inside the signal handler:
  * - waitpid() returns the status of the first process
  * - waitpid() returns 0 and the while() loop ends
  * plug-in 2 dies (before the signal handler returns)
  * SIGCHLD cannot be delivered
  * - the signal handler returns
  * The Gimp continues and the status of second plug-in is never
    collected, so this creates a zombie.
Although it is rather unlikely that a context switch happens in the
signal handler just after the while() loop and before returning, it
can and will happen.

If you are sure that all kernels of the systems that Gimp supports
will re-deliver the second SIGCHLD signal immediately after the signal
handler returns, then there is no problem in the scenario described
above: the handler will be called a second time and will collect the
status of the second plug-in.  But I do not think so; that's why I
suggested to call waitpid() outside the signal handler, because then
you avoid the race condition.

> > Yes...  I wrote a few months ago that I would change that and implement
> > some kind of --enable-stack-trace option, but I never took the time to
> > do it.
> 
> Now it's there :) We just have to convert the remaining g_print() to write()
> and the handler will be totally safe if enable_stack_trace == FALSE.

Hmmm...  I don't know how you have implemented it (I cannot look at
the CVS code), but I was thinking of having more options for
--enable-stack-trace, both for the configure option and for the
command-line option:
- "never": disable the stack trace and use the default signal handlers
- "query": ask the user, as we are doing now.
- "always": always call the debugger without asking any question (useful
          if stdin does not work, but stdout is still OK)
- "wait": always wait 30 seconds, without asking any question (this does
          not use stdin or stdout and gives some time to the user if she
          wants to attach a debugger to the process)
If the code is compiled or run with the "never" option, then the
signal handlers would never be installed.  The configure option would
set the default value for this flag, but it would always be possible
to override it with the command-line option without recompiling.  The
default for the code in CVS and for the unstable tarballs should be
"query", and the default for the stable releases could be "never" or
"wait".

> We should probably re-add the reentrancy guards in the fatal handlers and
> just do a brute force exit() if it's called recursively (which can only
> happen during the stack trace because that's the only case where the signals
> are unblocked in the handler).

Another option would be to set the signals to SIG_DFL as soon as the
fatal handler is called.  Since the default behavior for these signals
is to kill the program, this would prevent the endless loop without
requiring a special flag.

On Fri, 12 May 2000, Michael Natterer <mitschel@xxxxxxxxxxxxxxx> wrote:
[...]
> So ignoring SIGPIPE in the app causes it to be ignored in children, too.
> To avoid the need of resetting the signal we could just connect it
> to a dummy handler and let exec() do it's job of resetting it.

Well, as I wrote above and in a previous mail, there is a convenient
SIG_DFL that exits for that purpose, so you do not have to install a
dummy handler.

-Raphael