Re: [PATCH 1/8] signal: Make SIGKILL during coredumps an explicit special case

ebiederm@xxxxxxxxxxxx (Eric W. Biederman) · Tue, 04 Jan 2022 13:47:05 -0600

Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx> writes:

> On Mon, Dec 13, 2021 at 2:54 PM Eric W. Biederman <ebiederm@xxxxxxxxxxxx> wrote:
>>
>>
>>         if (signal->flags & (SIGNAL_GROUP_EXIT | SIGNAL_GROUP_COREDUMP)) {
>> -               if (!(signal->flags & SIGNAL_GROUP_EXIT))
>> -                       return sig == SIGKILL;
>> +               struct core_state *core_state = signal->core_state;
>> +               if (core_state) {
>
> This change is very confusing.
>
> Also, why does it do that 'signal->core_state->dumper.task', when we
> already know that it's the same as 'signal->group_exit_task'?
>
> The only thing that sets 'signal->core_state' also sets
> 'signal->group_exit_task', and the call chain has set both to the same
> task.
>
> So the code is odd and makes little sense.

As you say signal->group_exit_task, and core_state->dumper.task point to
the same task.  So it may be a little silly when viewed independently of
everything else to use core_state->dumper.task instead of
group_exit_task as it is an extra cache line dereference.

The thing is signal->group_exit_task is only used by coredumps currently
as a flag to tell signal_group_exit to return true.  It is exec that
actually uses signal->group_exit_task in conjunction with
signal->notify_count to wake itself up.

Using a pointer as a flag and not for it's value.  Having different
semantics for who sets the pointer.  All of those are weird enough
I just want to make signal->group_exit_task to go away.

By using core_state->dumper.task I was able to make
signal->group_exit_task exclusive to the exec case in the following
changes, and to rename it signal->group_exec_task so no one gets
confused what the field is for.

> But what's even more odd is how it
>
>  (a) sends the SIGKILL to somebody else
>
>  (b) does *NOT* send SIGKILL to itself
>
> Now, (a) is explained in the commit message. The intent is to signal
> the core dumper.

Which is the a specific thread of the target process, and it is
the only thread running of the target process.

> But (b) looks like a fundamental change in semantics. The target of
> the SIGKILL is still running, might be in some loop in the kernel that
> wants to be interrupted by a fatal signal, and you expressly disabled
> the code that would send that fatal signal.
>
> If I send SIGKILL to thread A, then that SIGKILL had *better* be
> delivered. To thread A, which may be in a "mutex_lock_killable()" or
> whatever else.
>
> The fact that thread B may be in the process of trying to dump core
> doesn't change that at all, as far as I can see.
>
> So I think this patch is fundamentally buggy and wrong. Or at least
> needs much more explanation of why you'd not send SIGKILL to the
> target thread.

If you look at zap_threads.  You can observe that it takes the siglock,
sets SIGNAL_GROUP_COREDUMP, and sets signal->core_state and in
zap_process makes SIGKILL pending is the per-task sigset, and calls
signal_wake_up on every task.

This case in prepare_signal happens after that.  After every task
has been told to die, and __fatal_signal_pending is true for all of
them if they have not reached do_exit yet.

If you look in zap_threads you will see that the core dumping thread
clears TIF_SIGPENDING, and in general makes fatal_signal_pending false
for itself.  But keep in mind that this thread because it is dumping
core is already on the path to do_exit.  It has already processed a
fatal signal.

So in the special case I only worry about the dumping task as it is the
only task after zap_threads that does not have fatal_signal_pending.

This is different than the ordinary case of delivering SIGKILL
where complete_signal makes SIGKILL pending in the per-task sigset
of every task in the process.

Currently I suspect changing wait_event_uninterruptible to
wait_event_killable, is causing problems.

Or perhaps there is some reason tasks that have already entered do_exit
need to have fatal_signal_pending set. (The will have
fatal_signal_pending set up until they enter get_signal which calls
do_group_exit which calls do_exit).

Which is why I am trying to reproduce the reported failure so I can get
the kernel to tell me what is going on.  If this is not resolved quickly
I won't send you this change, and I will pull it out of linux-next.

Eric