Still trying to track down why sometimes zombie processes survive on the
(Linux) Jenkins build machines (and then make later, unrelated Jenkins
builds on those machines fail when zombie soffice.bin processes still
hold onto named pipes that tests from the new builds want to create too).
One such recent case on tb79 was the aborted
<https://ci.libreoffice.org/job/gerrit_linux_clang_dbgutil/49895/>. It
left behind a zombie python.bin -> oosplash -> soffice.bin process tree
executing UITest_calc_tests3. (Where presumably the soffice.bin process
had deadlocked, which then caused the Jenkins
Build timed out (after 15 minutes). Marking the build as aborted.
Build was aborted
Finished: ABORTED
reaction. But once I noticed, the images of the involved processes had
already been overwritten by later builds, so I couldn't use gdb to get
backtraces.)
<https://ci.libreoffice.org/job/gerrit_linux_clang_dbgutil/49895/consoleFull>
shows that some entity runs lode's tb_slave_wrapper as (the main) part
of the build, see
[linux_clang_dbgutil_64] $ /bin/sh -xe /tmp/jenkins3389683698813990355.sh
+ /home/tdf/lode/bin/tb_slave_wrapper --real --mode=config --clean
That tb_slave_wrapper script contains
trap cleanup 1 2 3 6 15
cleanup()
{
echo "Caught Signal ... killing everything...."
# kill everything in same process group (pseudo-pid 0)
kill -9 0
}
intended to kill all processes if the script itself receives any of
SIGHUP/-INT/-QUIT/-ABRT/-TERM.
But how does the tb_slave_wrapper script get terminated by whatever
entity that starts it and prints out the
Build timed out (after 15 minutes). Marking the build as aborted.
Build was aborted
Finished: ABORTED
mentioned above? Could it be that the script itself gets killed with
SIGKILL, so its cleanup() trap doesn't fire, and processes (indirectly)
spawned from the script may stay alive?
Interestingly, the output from the above
echo "Caught Signal ... killing everything...."
doesn't show up anywhere in
<https://ci.libreoffice.org/job/gerrit_linux_clang_dbgutil/49895/consoleFull>
(supporting the theory that cleanup() doesn't run), while other output
that apparently stems from similar echo/printf commands in that script
does show up there, see
OS:
pwd:/home/tdf/lode/jenkins/workspace/lo_gerrit/Config/linux_clang_dbgutil_64
config mode : linux_clang_dbgutil_64
Taking configuration values from ./distro-configs/Jenkins/linux_clang_dbgutil_64
_______________________________________________
LibreOffice mailing list
LibreOffice@xxxxxxxxxxxxxxxxxxxxx
https://lists.freedesktop.org/mailman/listinfo/libreoffice