Re: another bizarre thing...

"Young, Gregory" <gregory.young@xxxxxxxxxxxxxx> · Thu, 8 Aug 2019 17:06:06 +0000

Is this on both EL6 and EL7? If only EL7, it could be control groups causing the issue. The idea of cgroups is to prevent zombie processes, but if you need your program to spawn another process then restart itself while the other process continues to run, you need to launch it in a different control group, or the shutdown of the parent process will also kill the child. In my case, we have an upgrade script which needs to get called, then shut down the calling process in order to upgrade it. For example:

# Clear any errors in the upgrade control group.
/bin/systemctl reset-failed upgrade-trigger

# Launch the upgrader in its own control group.
/bin/systemd-run --unit=upgrade-trigger --slice=upgrade-trigger /bin/bash /opt/myapp/Upgrade.sh "$1" "$2"

If we don't do this, the upgrade fails as the upgrader get's terminated when the parent application is shut down.

Gregory Young 

-----Original Message-----
From: CentOS <centos-bounces@xxxxxxxxxx> On Behalf Of Fred Smith
Sent: August 7, 2019 1:39 PM
To: centos@xxxxxxxxxx
Subject: Re:  another bizarre thing...

On Mon, Aug 05, 2019 at 08:57:45PM -0400, Fred Smith wrote:
> Hi all!
> 
> I'm stuck on something really bizarre that is happening to a product I 
> "own" at work. It's a C program, built on CentOS, runs on CentOs or 
> RHEL, has been in circulation since the early 00's, is in use at 
> hundreds of sites.
> 
> recently, at multiple customer sites it has started just going away.
> no core file (yes, ulimit is configured), nothing in any of its
> (several) log files. it's just gone.
> 
> running it under strace until it dies reveals that every thread has 
> been given a SIGKILL.
> 
> How does one figure out who deliverd a SIGKILL? For other, non-fatal, 
> signals it is possible to glean the PID of the sending process in a 
> signal  handler, but obviously you can't do that for SIGKILL because 
> the app doesn't survive the signal.
> 
> I'm grasping at straws here, and am open to almost any kind of 
> suggestion that can be followed-up (as compared to "beats me" which is 
> where I am now).

OK, more information.

Found a recipe to cause systemtap to emit a line of text identifying the sender of the SIGKILL.

probe signal.send {
  if (sig_name == "SIGKILL")
    printf("%s was sent to %s (pid:%d) by %s uid:%d\n",
           sig_name, pid_name, sig_pid, execname(), uid())

unfortunately, it says the program is killing itself:

	SIGKILL was sent to myprog (pid:12269) by myprog uid:1000

So,... now I'm wondering how one figures that out. nowhere in my source code does it explicitly raise any signal, much less SIGKILL.
So there must be some underlying library or system call or something doing it.

--
---- Fred Smith -- fredex@xxxxxxxxxxxxxxxxxxxxxx -----------------------------
                       I can do all things through Christ 
                              who strengthens me.
------------------------------ Philippians 4:13 ------------------------------- _______________________________________________
CentOS mailing list
CentOS@xxxxxxxxxx
https://lists.centos.org/mailman/listinfo/centos
_______________________________________________
CentOS mailing list
CentOS@xxxxxxxxxx
https://lists.centos.org/mailman/listinfo/centos