How are Jenkins builds killed exactly?

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Still trying to track down why sometimes zombie processes survive on the (Linux) Jenkins build machines (and then make later, unrelated Jenkins builds on those machines fail when zombie soffice.bin processes still hold onto named pipes that tests from the new builds want to create too).

One such recent case on tb79 was the aborted <https://ci.libreoffice.org/job/gerrit_linux_clang_dbgutil/49895/>. It left behind a zombie python.bin -> oosplash -> soffice.bin process tree executing UITest_calc_tests3. (Where presumably the soffice.bin process had deadlocked, which then caused the Jenkins

Build timed out (after 15 minutes). Marking the build as aborted.
Build was aborted
Finished: ABORTED

reaction. But once I noticed, the images of the involved processes had already been overwritten by later builds, so I couldn't use gdb to get backtraces.)

<https://ci.libreoffice.org/job/gerrit_linux_clang_dbgutil/49895/consoleFull> shows that some entity runs lode's tb_slave_wrapper as (the main) part of the build, see

[linux_clang_dbgutil_64] $ /bin/sh -xe /tmp/jenkins3389683698813990355.sh
+ /home/tdf/lode/bin/tb_slave_wrapper --real --mode=config --clean

That tb_slave_wrapper script contains

trap cleanup 1 2 3 6 15

cleanup()
{
  echo "Caught Signal ... killing everything...."
  # kill everything in same process group (pseudo-pid 0)
  kill -9 0
}

intended to kill all processes if the script itself receives any of SIGHUP/-INT/-QUIT/-ABRT/-TERM.

But how does the tb_slave_wrapper script get terminated by whatever entity that starts it and prints out the

Build timed out (after 15 minutes). Marking the build as aborted.
Build was aborted
Finished: ABORTED

mentioned above? Could it be that the script itself gets killed with SIGKILL, so its cleanup() trap doesn't fire, and processes (indirectly) spawned from the script may stay alive?

Interestingly, the output from the above

  echo "Caught Signal ... killing everything...."

doesn't show up anywhere in <https://ci.libreoffice.org/job/gerrit_linux_clang_dbgutil/49895/consoleFull> (supporting the theory that cleanup() doesn't run), while other output that apparently stems from similar echo/printf commands in that script does show up there, see

OS:
pwd:/home/tdf/lode/jenkins/workspace/lo_gerrit/Config/linux_clang_dbgutil_64
config mode : linux_clang_dbgutil_64
Taking configuration values from ./distro-configs/Jenkins/linux_clang_dbgutil_64

_______________________________________________
LibreOffice mailing list
LibreOffice@xxxxxxxxxxxxxxxxxxxxx
https://lists.freedesktop.org/mailman/listinfo/libreoffice



[Index of Archives]     [LARTC]     [Bugtraq]     [Yosemite Forum]     [Photo]

  Powered by Linux