Re: SuccessExitStatus , user slice, SSH ?

Steve Traylen <steve.traylen@xxxxxxx> · Thu, 31 Oct 2024 15:09:12 +0100

On 31/10/2024 12:48, Lennart Poettering wrote:

On Do, 31.10.24 10:20, Steve Traylen (steve.traylen@xxxxxxx) wrote:

Hi,

I was trying to suppress user scope units that are considered failed due to
them requiring a SIGKILL. Typical log might be.

Oct 30 10:27:55 node989.example.ch systemd[1]: session-3804.scope: Killing
process 1550946 (node) with signal SIGKILL.
Oct 30 10:29:25 node989.example.ch systemd[1]: session-3804.scope: Still
around after SIGKILL. Ignoring.
Oct 30 10:29:25 node989.example.ch systemd[1]: session-3804.scope: Failed
with result 'timeout'.
Oct 30 10:29:25 node989.example.ch systemd[1]: session-3804.scope: Consumed
1min 30.745s CPU time.

I doubt increasing the timeout will help. I had thought that a

# /etc/systemd/system/user-.slice.d/ignore-timeout.conf

[Slice]
SuccessExitStatus=SIGKILL

might help but alas SuccessExitStatus can only be set on a services it
seems.
SuccessExitStatus is really just about process exit statuses,
i.e. about the waitid() info that the service manager will see for the
main service process.

In your case the scope fails due to the timeout, not because the
service would exit due to a SIGKILL (after all, it *doesn*t exit even
though the SIGKILL was sent).

Note that a process that is unkillable even by SIGKILL usually
indicates some driver/kernel bug. Unkillable processes are not the
norm on Linux.

I am not entirely sure what we are trying to do? You don't want to see the
losg about the scope not going away? We don#t support suppressing such
logs for scope units.

Or are you concerned that these .scope units stick around? Well, they
do that because they still contain a process. We do not support a
mechanism to hide units that still have running processes, sorry. And
I am pretty sure we should not support this.

I'd recommend figuring out why these processes do not react to SIGKILL
instead of hiding the issue.

Yes I did not read that log carefully enough - that one is super bad if 
SIGKILL does not work it should
not be hidden. What I thought I was asking about was this case.

Oct 30 12:32:30 node9107.example.ch systemd[1]: session-12336.scope: 
Stopping timed out. Killing.
Oct 30 12:32:30 node9107.example.ch systemd[1]: session-12336.scope: 
Killing process 2227059 (bash) with signal SIGKILL.
Oct 30 12:32:49 node9107.example.ch systemd[1]: session-12336.scope: 
Failed with result 'timeout'.

was the less bad version where I read it as the SIGTERM failed but the 
SIGKILL was successful.

In this less bad case is that still a Failed with result 'timeout' if 
SIGTERM "fails" and it relies on SIGKILL or is that the SIGKILL failing.

I was hoping not see the logs (as a failure) for the resorting to a 
successful SIGKILL.

Steve.

Lennart

--
Lennart Poettering, Berlin