Following up on the results of the email thread starting at
<https://lists.freedesktop.org/archives/libreoffice/2019-December/084084.html>
"How are Jenkins builds killed exactly?",
<https://git.libreoffice.org/lode/+/bded43937c6efc82efc5924820a281c8a1ead5ba%5E%21>
"kill-wrapper: pstree of hung processes" had tried to improve the
information provided for a hung and aborted Jenkins build. Typically,
such a build is aborted because one or more tests hang, and it would be
interesting to at least learn which tests hung. To that end, that
commit tried to print pstree output of any leftover processes---but
failed, see the comment at
<https://gerrit.libreoffice.org/c/lode/+/91496/2#message-8e52d669f48a9edb5f183d1221164784059e8959>
"kill-wrapper: pstree of hung processes" for details.
Now,
<https://git.libreoffice.org/lode/+/92c9372417f883781471bade5e703518bd1cd5c6%5E%21>
"Incorporate timeout-on-idle into kill-wrapper, renaming to
timeout-kill-wrapper" and its follow-up
<https://git.libreoffice.org/lode/+/4d6d63299fea804ed7cdf63dde46922ed81b4e8a%5E%21>
"Simplify transition from old kill-wrapper to new timeout kill-wrapper"
fix that, by moving the timeout handling from Jenkins into lode's
bin/kill-wrapper. (Which accepts an optional second argument now,
specifying a stdout/-err inactivity timeout in seconds, after which the
pstree output is generated and the process tree gets killed. Leaving
the argument out or specifying it as zero disables that timeout logic.)
For now, I have updated
<https://ci.libreoffice.org/job/gerrit_linux_clang_dbgutil/> to use the
new kill-wrapper timeout feature instead of Jenkins' "Abort the build if
it's stuck" option. (And am planning to roll it out to other Linux
Jenkins jobs that could benefit from it, once it has proven sufficiently
stable.)
<https://ci.libreoffice.org/job/gerrit_linux_clang_dbgutil/60539/> is a
live example of such an aborted Gerrit Jenkins job. One noticeable
difference is that such a job is now marked as failed (red dot) rather
than as aborted (gray dot). But a new "kill-wrapper" (i.e.,
<https://ci.libreoffice.org/job/gerrit_linux_clang_dbgutil/failure-cause-management/48ce9c26-9d0a-43a8-83d8-c44f54920d59/>)
failure cause label should make the actual reason of the failure
obvious. And the pstree output
(<https://ci.libreoffice.org/job/gerrit_linux_clang_dbgutil/60539/consoleFull#147661240548ce9c26-9d0a-43a8-83d8-c44f54920d59>),
while probably a bit overwhelming, should show that apparently all of
UITest_calc_tests, UITest_calc_tests4, UITest_calc_tests7, UITest_chart,
and UITest_demo_ui hung in this case. That should give at least a hint
where to start local debugging...
_______________________________________________
LibreOffice mailing list
LibreOffice@xxxxxxxxxxxxxxxxxxxxx
https://lists.freedesktop.org/mailman/listinfo/libreoffice