Re: Figure out what killed an app (rhbz#2253099)

Florian Weimer <fweimer@xxxxxxxxxx> · Mon, 05 Feb 2024 14:31:51 +0100

* Yanko Kaneti:

> On Sat, 2024-02-03 at 13:29 +0100, Ondrej Mosnáček wrote:
>> On Fri, 2 Feb 2024 at 17:52, Yanko Kaneti <yaneti@xxxxxxxxxxx> wrote:
>> > 
>> > On Thu, 2024-02-01 at 09:44 +0100, Ondrej Mosnáček wrote:
>> > > On Thu, 1 Feb 2024 at 09:13, Milan Crha <mcrha@xxxxxxxxxx> wrote:
>> > > > The kernel tracing log for sig==9 shows:
>> > > > 
>> > > > gnome-terminal--2924    [002] dN.2.  2520.462889: signal_generate:
>> > > > sig=9 errno=0 code=128 comm=alloc-too-much pid=3502 grp=1 res=0
>> > > > 
>> > > > There is no such thing (apart of the tracing log) when Evolution is
>> > > > suddenly killed, the logs are muted. That makes me believe it's not the
>> > > > OOM killer whom kills the application. I'm back at square 1; or maybe
>> > > > square 2, as I know possibly kernel sends it, but not why.
>> > > 
>> > > Try running `echo stacktrace >/sys/kernel/tracing/trace_options` (as
>> > > root) and then collect the kernel trace again. That should give you a
>> > > backtrace of kernel functions from the signal generation, which could
>> > > help you/us to figure out the reason the process was killed.
>> > 
>> > So, figured the easiest way to help trigger the kill here is to put load
>> > on the machine.
>> > 
>> >  $ stress-ng --cpu -1 --cpu-method all -t 5m --cpu-load 95
>> > 
>> > then starting evolution seems to do it almost every time shortly after
>> > start (I have around 200k messages in the folder its trying to display)
>> > 
>> > I've enabled the stacktrace tracing option and like Milan set a sig==9
>> > filter. And here is what I got in the trace buffer after it was killed
>> > 
>> > # tracer: nop
>> > #
>> > # entries-in-buffer/entries-written: 34/34   #P:16
>> > #
>> > #                                _-----=> irqs-off/BH-disabled
>> > #                               / _----=> need-resched
>> > #                              | / _---=> hardirq/softirq
>> > #                              || / _--=> preempt-depth
>> > #                              ||| / _-=> migrate-disable
>> > #                              |||| /     delay
>> > #           TASK-PID     CPU#  |||||  TIMESTAMP  FUNCTION
>> > #              | |         |   |||||     |         |
>> >        evolution-9096    [002] d..1.  1207.016489: signal_generate: sig=9 errno=0 code=128 comm=evolution pid=9096 grp=1 res=0
>> >        evolution-9096    [002] d..1.  1207.016495: <stack trace>
>> >  => trace_event_raw_event_signal_generate
>> >  => __send_signal_locked
>> >  => posix_cpu_timers_work
>> >  => task_work_run
>> >  => irqentry_exit_to_user_mode
>> >  => asm_sysvec_apic_timer_interrupt
>> 
>> So, browsing through the relevant kernel code, it seems the only cases
>> which could have led to this backtrace are:
>> 1. When a task's RT timeout goes over the RLIMIT_RTTIME hard limit
>> (see function check_thread_timers() in
>> kernel/time/posix-cpu-timers.c).
>> 2. When a task's CPU time goes over the RLIMIT_CPU hard limit (see
>> function check_process_timers() in kernel/time/posix-cpu-timers.c).
>> 
>> I may have missed some code path, but these resource limits should be
>> the next thing to check.
>
> Thanks a lot for diving it. This is probably it.
> The kernel killing the main evolution process which was set as having rt
> priority by some webkit coincidence.
>
> Running with rtkit-daemon stoped and masked prevents the main evo
> process being prioritized and consequently killed with higher load.
>
> The fact that the kernel kills the (main)thread without a beep anywhere
> in the logs is strange to me.

Interesting analysis, thanks!

I think it's not so unusual for the kernel not to log the SIGKILL.  This
does not seem to be purely an emergency measure like the OOM killer—it's
more like something that an application (set of processes) told the
kernel to do to itself.  As such, it's not unreasonable for the
application to do the logging.  On the other hand, I think it's
difficult for rtkit-daemon to arrange for that logging because it can't
monitor the target process directly.

Any idea why the signal is SIGKILL and not SIGXCPU?  To ensure that the
process actually terminates promptly, without running signal handlers or
generating a coredump?  SIGXCPU would point more directly to the cause.

Thanks,
Florian
--
_______________________________________________
devel mailing list -- devel@xxxxxxxxxxxxxxxxxxxxxxx
To unsubscribe send an email to devel-leave@xxxxxxxxxxxxxxxxxxxxxxx
Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/
List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
List Archives: https://lists.fedoraproject.org/archives/list/devel@xxxxxxxxxxxxxxxxxxxxxxx
Do not reply to spam, report it: https://pagure.io/fedora-infrastructure/new_issue