Child process running CGI script does not terminate after exit() and is killed after a 3s delay, under Solaris 8

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hello,

using Apache 1.3.27 on Solaris 8 we see a problem with returning from a CGI script. I'd like to ask if this is a known phenomenon, and if there is a workaround or patch or a
page about it. I searched the Web, the Apache sources and the archives of this mailing list but I could not find a hint about this particular problem. One posting lead me to
the function in alloc.c mentioned below.

We are using Apache 1.3.27 with mod_cgi on Solaris 8. With both the version provided by Oracle and the Sun Freeware version we have the following problem: a CGI script builds
a page that contains references to additional files, like Javascript libraries and CSS files. Loading of these files can be delayed by approximately 3 seconds (seen with the
Firefox plugin Firebug on client side, and in both snoop and truss logs on server side), which significantly decreases site performance. Not *every* call is affected, but the
delays occur quite often.

The original script was written in Perl but perl ist not the problem - the delay occurs with a shell script as well:

 #!/usr/bin/tcsh

 echo 'Content-Type: text/html; charset=ISO-8859-1'
 echo ''
 echo '<html>'
 echo ' <head>'
 echo '<title>Load Delay</title>'
 echo '<script src="/xyz.js" type="text/javascript"></script>'
 echo '<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1" />' echo '</head>' echo '<body>' echo 'Load delay.'
 echo '</body>'
 echo '</html>'

According to truss the additional files are not the problem. We can reproduce the delay with a simple script that generates a page without supplementary file requests:

 #!/usr/local/bin/perl -w

 print <<EOHTML;
 Content-Type: text/html; charset=ISO-8859-1

 <html>
  <head>
   <title>Load Delay</title>
   <meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1" />
  </head>
  <body>
   Load Delay
  </body>
 </html>

 EOHTML

Using truss we get this typical trace for the delay case:

22173/1:  53.3753 fork1()                                         = 29230 29230/1:  53.3753 fork1()         (returning as child ...)        = 22173

                [... CGI script processing ...]

29230/1:  53.4027 write(1, " C o n t e n t - T y p e".., 239)     = 239 22173/1:  53.4028 read(8, " C o n t e n t - T y p e".., 4096)     = 239 22173/1:  53.4031 time()
                                   = 1177084845 29230/1:  53.4031 llseek(0, 0, SEEK_CUR)                          Err#29 ESPIPE 29230/1:  53.4032 _exit(0)

22173/1:  53.4033 fstat64(8, 0xFFBE55C0)                          = 0 22173/1:  53.4034 fcntl(8, F_SETFL, 0x00000080)                   = 0 22173/1:  53.4036 read(8,
0xFFBE57B0, 8192)                       = 0 22173/1:  53.4037 close(8)                                        = 0 22173/1:  53.4038 read(10, 0x000D0920, 4096)
     = 0 22173/1:  53.4039 close(10)                                       = 0 22173/1:  53.4040 poll(0xFFBEB888, 1, 0)                          = 0 22173/1:  53.4042
write(5, " H T T P / 1 . 1   2 0 0".., 421)     = 421 22173/1:  53.4044 time()                                          = 1177084845 22173/1:  53.4047 write(17, " 1 3 7 . 1 6
7 . 2 2 0 .".., 103)    = 103 22173/1:  53.4048 waitid(P_PID, 29230, 0xFFBEB868, WEXITED|WTRAPPED|WNOHANG) = 0 22173/1:  53.4049 getpid()
  = 22173 [22169] 22173/1:  53.4049 kill(29230, SIGTERM)                            = 0 22173/1:  53.4050 lwp_cond_signal(0xFF0ECD30)                     = 0 22173/4:
53.4050 lwp_cond_wait(0xFF0ECD30, 0xFF0ECD18, 0x00000000) = 0 22173/1:  lwp_sema_wait(0x00067260)       (sleeping...) 22173/2:  signotifywait()                 (sleeping...)
22173/3:  lwp_sema_wait(0xFDF0DE60)       (sleeping...)
22173/4:  lwp_cond_wait(0xFF0ECD30, 0xFF0ECD18, 0xFDE0BD30) (sleeping...) 22173/4:  56.4115 lwp_cond_wait(0xFF0ECD30, 0xFF0ECD18, 0xFDE0BD30) Err#62 ETIME 22173/4:  56.4116
lwp_sema_post(0x00067260)                       = 0 22173/1:  56.4118 lwp_sema_wait(0x00067260)                       = 0 22173/1:  56.4119 lwp_mutex_lock(0xFF0F3500)
             = 0 22173/4:  56.4119 lwp_cond_signal(0x00067318)                     = 0 22173/1:  56.4120 getpid()                                        = 22173 [22169]
22173/1:  56.4121 kill(29230, SIGKILL)                            = 0 22173/1:  56.4121 waitid(P_PID, 29230, 0xFFBEB868, WEXITED|WTRAPPED) = 0

So, the child process running the CGI script calls exit(), but it seems it cannot be cleaned up by the kernel, so waitpid() and SIGTERM fail. Finally, after a delay of three
seconds, Apache sends SIGKILL to clean up the child process. Subsequent requests are not served until then.

To me it seems the truss log matches the code in Apaches source file alloc.c, in free_proc_chain() (I am not familiar with Apache sources so it is an assumption this is the
code running):

static void free_proc_chain(struct process_chain *procs)
{
    /* Dispose of the subprocesses we've spawned off in the course of
     * whatever it was we're cleaning up now.  This may involve killing * some of them off... */

    struct process_chain *p;
    int need_timeout = 0;
    int status;

    if (procs == NULL)
    return;            /* No work.  Whew! */

    /* First, check to see if we need to do the SIGTERM, sleep, SIGKILL
     * dance with any of the processes we're cleaning up.  If we've got * any kill-on-sight subprocesses, ditch them now as well, so they * don't waste any more cycles doing
whatever it is that they shouldn't * be doing anymore.
     */
#ifdef WIN32
   [... WIN32 code cut out as we are running Solaris ...]
#elif defined(NETWARE)
#else
#ifndef NEED_WAITPID
    /* Pick up all defunct processes */
    for (p = procs; p; p = p->next) {
    if (waitpid(p->pid, (int *) 0, WNOHANG) > 0) {
        p->kill_how = kill_never;
    }
    }
#endif

    for (p = procs; p; p = p->next) {
    if ((p->kill_how == kill_after_timeout)
        || (p->kill_how == kill_only_once)) {
        /* Subprocess may be dead already.  Only need the timeout if not. */ if (ap_os_kill(p->pid, SIGTERM) != -1) need_timeout = 1;
    }
    else if (p->kill_how == kill_always) {
        kill(p->pid, SIGKILL);
    }
    }

    /* Sleep only if we have to... */

    if (need_timeout)
    sleep(3);

    /* OK, the scripts we just timed out for have had a chance to clean up
     * --- now, just get rid of them, and also clean up the system accounting * goop... */

    for (p = procs; p; p = p->next) {

    if (p->kill_how == kill_after_timeout)
        kill(p->pid, SIGKILL);

    if (p->kill_how != kill_never)
        waitpid(p->pid, &status, 0);
    }
#endif /* WIN32 */
}


In this code there are a waitpid() call with WNOHANG, a sending of SIGTERM, a three second delay, and a final SIGKILL, similar to the truss log.

The question is why the child process cannot be cleaned up after exit() (sometimes it works, but very often it fails, with the same script). Is this a known problem? Is there
a workaround? Is it probably related to the operating system version (Solaris 8), and if yes is there a patch? (The system is well maintained and patched but possibly a
special patch was missed?)

Any help or hint would be appreciated.

Thank you in advance

                  Jochen










---------------------------------------------------------------------
The official User-To-User support forum of the Apache HTTP Server Project.
See <URL:http://httpd.apache.org/userslist.html> for more info.
To unsubscribe, e-mail: users-unsubscribe@xxxxxxxxxxxxxxxx
   "   from the digest: users-digest-unsubscribe@xxxxxxxxxxxxxxxx
For additional commands, e-mail: users-help@xxxxxxxxxxxxxxxx


[Index of Archives]     [Open SSH Users]     [Linux ACPI]     [Linux Kernel]     [Linux Laptop]     [Kernel Newbies]     [Security]     [Netfilter]     [Bugtraq]     [Squid]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Samba]     [Video 4 Linux]     [Device Mapper]

  Powered by Linux