More information about hung Jenkins builds

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Following up on the results of the email thread starting at <https://lists.freedesktop.org/archives/libreoffice/2019-December/084084.html> "How are Jenkins builds killed exactly?", <https://git.libreoffice.org/lode/+/bded43937c6efc82efc5924820a281c8a1ead5ba%5E%21> "kill-wrapper: pstree of hung processes" had tried to improve the information provided for a hung and aborted Jenkins build. Typically, such a build is aborted because one or more tests hang, and it would be interesting to at least learn which tests hung. To that end, that commit tried to print pstree output of any leftover processes---but failed, see the comment at <https://gerrit.libreoffice.org/c/lode/+/91496/2#message-8e52d669f48a9edb5f183d1221164784059e8959> "kill-wrapper: pstree of hung processes" for details.

Now, <https://git.libreoffice.org/lode/+/92c9372417f883781471bade5e703518bd1cd5c6%5E%21> "Incorporate timeout-on-idle into kill-wrapper, renaming to timeout-kill-wrapper" and its follow-up <https://git.libreoffice.org/lode/+/4d6d63299fea804ed7cdf63dde46922ed81b4e8a%5E%21> "Simplify transition from old kill-wrapper to new timeout kill-wrapper" fix that, by moving the timeout handling from Jenkins into lode's bin/kill-wrapper. (Which accepts an optional second argument now, specifying a stdout/-err inactivity timeout in seconds, after which the pstree output is generated and the process tree gets killed. Leaving the argument out or specifying it as zero disables that timeout logic.)

For now, I have updated <https://ci.libreoffice.org/job/gerrit_linux_clang_dbgutil/> to use the new kill-wrapper timeout feature instead of Jenkins' "Abort the build if it's stuck" option. (And am planning to roll it out to other Linux Jenkins jobs that could benefit from it, once it has proven sufficiently stable.)

<https://ci.libreoffice.org/job/gerrit_linux_clang_dbgutil/60539/> is a live example of such an aborted Gerrit Jenkins job. One noticeable difference is that such a job is now marked as failed (red dot) rather than as aborted (gray dot). But a new "kill-wrapper" (i.e., <https://ci.libreoffice.org/job/gerrit_linux_clang_dbgutil/failure-cause-management/48ce9c26-9d0a-43a8-83d8-c44f54920d59/>) failure cause label should make the actual reason of the failure obvious. And the pstree output (<https://ci.libreoffice.org/job/gerrit_linux_clang_dbgutil/60539/consoleFull#147661240548ce9c26-9d0a-43a8-83d8-c44f54920d59>), while probably a bit overwhelming, should show that apparently all of UITest_calc_tests, UITest_calc_tests4, UITest_calc_tests7, UITest_chart, and UITest_demo_ui hung in this case. That should give at least a hint where to start local debugging...

_______________________________________________
LibreOffice mailing list
LibreOffice@xxxxxxxxxxxxxxxxxxxxx
https://lists.freedesktop.org/mailman/listinfo/libreoffice



[Index of Archives]     [LARTC]     [Bugtraq]     [Yosemite Forum]     [Photo]

  Powered by Linux