Re: Kernel stack read with PTRACE_EVENT_EXIT and io_uring threads

Al Viro <viro@xxxxxxxxxxxxxxxxxx> · Mon, 21 Jun 2021 23:05:23 +0000

On Mon, Jun 21, 2021 at 11:50:56AM -0500, Eric W. Biederman wrote:
> Al Viro <viro@xxxxxxxxxxxxxxxxxx> writes:
> 
> > On Mon, Jun 21, 2021 at 01:54:56PM +0000, Al Viro wrote:
> >> On Tue, Jun 15, 2021 at 02:58:12PM -0700, Linus Torvalds wrote:
> >> 
> >> > And I think our horrible "kernel threads return to user space when
> >> > done" is absolutely horrifically nasty. Maybe of the clever sort, but
> >> > mostly of the historical horror sort.
> >> 
> >> How would you prefer to handle that, then?  Separate magical path from
> >> kernel_execve() to switch to userland?  We used to have something of
> >> that sort, and that had been a real horror...
> >> 
> >> As it is, it's "kernel thread is spawned at the point similar to
> >> ret_from_fork(), runs the payload (which almost never returns) and
> >> then proceeds out to userland, same way fork(2) would've done."
> >> That way kernel_execve() doesn't have to do anything magical.
> >> 
> >> Al, digging through the old notes and current call graph...
> >
> > 	FWIW, the major assumption back then had been that get_signal(),
> > signal_delivered() and all associated machinery (including coredumps)
> > runs *only* from SIGPENDING/NOTIFY_SIGNAL handling.
> >
> > 	And "has complete registers on stack" is only a part of that;
> > there was other fun stuff in the area ;-/  Do we want coredumps for
> > those, and if we do, will the de_thread stuff work there?
> 
> Do we want coredumps from processes that use io_uring? yes
> Exactly what we want from io_uring threads is less clear.  We can't
> really give much that is meaningful beyond the thread ids of the
> io_uring threads.
> 
> What problems do are you seeing beyond the missing registers on the
> stack for kernel threads?
> 
> I don't immediately see the connection between coredumps and de_thread.
> 
> The function de_thread arranges for the fatal_signal_pending to be true,
> and that should work just fine for io_uring threads.  The io_uring
> threads process the fatal_signal with get_signal and then proceed to
> exit eventually calling do_exit.

I would like to see the testing in cases when the io-uring thread is
the one getting hit by initial signal and when it's the normal one
with associated io-uring ones.  The thread-collecting logics at least
used to depend upon fairly subtle assumptions, and "kernel threads
obviously can't show up as candidates" used to narrow the analysis
down...

In any case, WTF would we allow reads or writes to *any* registers of
such threads?  It's not as simple as "just return zeroes", BTW - the
values allowed in special registers might have non-trivial constraints
on them.  The same goes for coredump - we don't _have_ registers to
dump for those, period.

Looks like the first things to do would be
	* prohibit ptrace accessing any regsets of worker threads
	* make coredump skip all register notes for those

Note, BTW, that kernel_thread() and kernel_execve() do *NOT* step into
ptrace_notify() - explicit CLONE_UNTRACED for the former and zero
current->ptrace in the caller of the latter.  So fork and exec side
has ptrace_event() crap limited to real syscalls.

It's seccomp[1] and exit-related stuff that are messy...

[1] "never trust somebody who introduces himself as Honest Joe and keeps
carping on that all the time"; c.f. __secure_computing(), CONFIG_INTEGRITY,
etc.