Apache 2.2.15 + mod_fcgid 2.3.7 (CentOS 6.4) graceful restarts, no leftover processes, but errors both in browser and error log (Apache Users)

Hi all,

It was more than 9 months ago I discovered a problem with the graceful restarts on a default Virtualmin installation with the default execution mode (mod_fcgid), but recently I had the time to dig deeper and experiment. Since Virtualmin uses Apache + mod_fcgid by default, the experiments will probably lead to the same results on any Apache 2.2 + mod_fcgid 2.3.7 installation. This is not the widely known problem with leftover processes that never get killed on a graceful restart, this is something else - the processes get forcefully killed way to soon and you don't get the output to the browser. Please, test it on your setup and report back the result.

What is the setup:

CentOS 6.4 x86_64 minimal installation

Virtualmin 4.02.gpl GPL installed by the automatic .sh script, all default settings (you can skip this, the problem is probably not virtualmin related)

mod_fcgid.x86_64 2.3.7-1.el6 from the virtualmin repo (other should work too)

httpd.x86_64 1:2.2.15-29.el6.vm.1 from the virtualmin repo (other should work too)

php 5.3.3 from the official repo

Single virtual domain, running under the default FCGId execution mode, with 90 sec php execution time and fcgid IO wait.

Single test.php file containing

<?php

for($i = 1; $i <= 30; $i++) {

echo $i."\n";

sleep(1);

}

What is the error:

Run the script via browser, then go and do a graceful restart on apache (service httpd graceful). After around 12 seconds you are going to see "No data received" error in you browser (Chrome) and the following in the apache error log:

(22)Invalid argument: mod_fcgid: can't lock process table in pid 25570

(the pid number will be different of course)

Further experiments show that this script gets forcefully killed before ending.

If you reduce the time the script executes to 5 seconds ($i <= 4), you'll get the same result, this time after 5 seconds.

Further experiments show this process completes, but you still get the errors both in the browser and the error log.

Try it and post your result.

Dig:

It is probably a problem of mod_fcgid

I tweaked the experiment adding a file write at the end of the script which shows which script completes and which gets killed before that. I got the result above.

Add this inside the loop:

file_put_contents("test.txt", "test run for: ".$i." seconds");

So why 12 seconds and where is this set. After some time I discovered that increasing FcgidErrorScanInterval to 60 will let the second process to complete (but still you get the errors).

If you check the code of mod_fcgid In fcgid_pm_main.c, the graceful restart should be performed by the function kill_all_subprocess() but obviously the scan_errorlist() is also executed even if there is a check for procmgr_must_exit().

The error in the log "can't lock process table in pid 25570" probably means that some information about the process is destroyed immediately upon the graceful restart (the mutex), so we will never get the result back.

Even if we get around the early termination of the processes increasing FcgidErrorScanInterval the second problem is actually bigger - all your users are going to see this error.

Do you get the same errors and do you have idea how to fix mod_fcgid?

Thanks for your time, testing and commenting!

Georgi Petrov