Re: [PATCH v14] exec: Fix dead-lock in de_thread with ptrace_attach

Bernd Edlinger <bernd.edlinger@xxxxxxxxxx> · Wed, 17 Jan 2024 16:07:37 +0100

On 1/16/24 16:22, Oleg Nesterov wrote:
> I'll try to recall this problem and actually read the patch tommorrow...
> 
> Hmm. but it doesn't apply to Linus's tree, you need to rebase it.
> In particular, please note the recent commit 5431fdd2c181dd2eac2
> ("ptrace: Convert ptrace_attach() to use lock guards")
> 

Oh, how ugly...
Will this new C++-like "feature" ever make it into a stable branch?

> On 01/15, Bernd Edlinger wrote:
>>
>> The problem happens when a tracer tries to ptrace_attach
>> to a multi-threaded process, that does an execve in one of
>> the threads at the same time, without doing that in a forked
>> sub-process.  That means: There is a race condition, when one
>> or more of the threads are already ptraced, but the thread
>> that invoked the execve is not yet traced.  Now in this
>> case the execve locks the cred_guard_mutex and waits for
>> de_thread to complete.  But that waits for the traced
>> sibling threads to exit, and those have to wait for the
>> tracer to receive the exit signal, but the tracer cannot
>> call wait right now, because it is waiting for the ptrace
>> call to complete, and this never does not happen.
>> The traced process and the tracer are now in a deadlock
>> situation, and can only be killed by a fatal signal.
> 
> This looks very confusing to me. And even misleading.
> 
> So IIRC the problem is "simple".
> 
> de_thread() sleeps with cred_guard_mutex waiting for other threads to
> exit and pass release_task/__exit_signal.
> 
> If one of the sub-threads is traced, debugger should do ptrace_detach()
> or wait() to release this tracee, the killed tracee won't autoreap.
> 

Yes. but the tracer has to do its job, and that is ptrace_attach the
remaining treads, it does not know that it would avoid a dead-lock
when it calls wait(), instead of ptrace_attach.  It does not know
that the tracee has just called execve in one of the not yet traced
threads.

> Now. If debugger tries to take the same cred_guard_mutex before
> detach/wait we have a deadlock. This is not specific to ptrace_attach(),
> proc_pid_attr_write() takes this lock too.
> 
> Right? Or are there other issues?
> 

No, proc_pid_attr_write has no problem if it waits for cred_guard_mutex,
because it is only called from one of the sibling threads, and
zap_other_threads sends a SIGKILL to each of them, thus the
mutex_lock_interruptible will stop waiting, and the thread will 
exit normally.
It is only problematic when another process wants to lock the cred_guard_mutex,
because it is not receiving a signal, when de_thread is waiting.
The only other place where I am aware of this happening is ptrace_attach.

>> -static int de_thread(struct task_struct *tsk)
>> +static int de_thread(struct task_struct *tsk, struct linux_binprm *bprm)
>>  {
>>  	struct signal_struct *sig = tsk->signal;
>>  	struct sighand_struct *oldsighand = tsk->sighand;
>>  	spinlock_t *lock = &oldsighand->siglock;
>> +	struct task_struct *t = tsk;
>> +	bool unsafe_execve_in_progress = false;
>>
>>  	if (thread_group_empty(tsk))
>>  		goto no_thread_group;
>> @@ -1066,6 +1068,19 @@ static int de_thread(struct task_struct *tsk)
>>  	if (!thread_group_leader(tsk))
>>  		sig->notify_count--;
>>
>> +	while_each_thread(tsk, t) {
> 
> for_other_threads()
> 

Ah, okay.

>> +		if (unlikely(t->ptrace)
>> +		    && (t != tsk->group_leader || !t->exit_state))
>> +			unsafe_execve_in_progress = true;
> 
> The !t->exit_state is not right... This sub-thread can already be a zombie
> with ->exit_state != 0 but see above, it won't be reaped until the debugger
> does wait().
> 

I dont think so.
de_thread() handles the group_leader different than normal threads.
That means normal threads have to wait for being released from the zombie
state by the tracer:
sig->notify_count > 0, and de_thread is woken up by __exit_signal
Once those are gone, de_thread waits for the group leader to reach
exit_state = ZOMBIE, but again only if the group_leader is not the
current thread:
signal->notify_count < 0, and de_thread is woken up by exit_notify.
So his reflects exactly what condition has to be met, see:

                        sig->notify_count = -1;
                        if (likely(leader->exit_state))
                                break;
                        __set_current_state(TASK_KILLABLE);
                        write_unlock_irq(&tasklist_lock);
                        cgroup_threadgroup_change_end(tsk);
                        schedule();
                        if (__fatal_signal_pending(tsk))
                                goto killed;

so when the group_leader's exit_state is already != 0 then the
second wait state will not be entered.

>> +	if (unlikely(unsafe_execve_in_progress)) {
>> +		spin_unlock_irq(lock);
>> +		sig->exec_bprm = bprm;
>> +		mutex_unlock(&sig->cred_guard_mutex);
>> +		spin_lock_irq(lock);
> 
> I don't understand why do we need to unlock and lock siglock here...
> 

That is just a precaution because I did want to release the
mutexes exactly in the reverse order as they were acquired.

> But my main question is why do we need the unsafe_execve_in_progress boolean.
> If this patch is correct and de_thread() can drop and re-acquire cread_guard_mutex
> when one of the threads is traced, then why can't we do this unconditionally ?
> 

I just wanted to keep the impact of the change as small as possible, including
possible performance degradation due to double checking of credentials.
Worst thing that could happen with this approach, is that a situation where today
a dead-lock is imminentm does still not work correctly, but when no tracer is attached,
nothing will change or be less performant than before.

Bernd.