Re: Kernel stack read with PTRACE_EVENT_EXIT and io_uring threads

ebiederm@xxxxxxxxxxxx (Eric W. Biederman) · Tue, 22 Jun 2021 11:39:40 -0500

Al Viro <viro@xxxxxxxxxxxxxxxxxx> writes:

On Mon, Jun 21, 2021 at 11:50:56AM -0500, Eric W. Biederman wrote:
Al Viro <viro@xxxxxxxxxxxxxxxxxx> writes:

On Mon, Jun 21, 2021 at 01:54:56PM +0000, Al Viro wrote:
On Tue, Jun 15, 2021 at 02:58:12PM -0700, Linus Torvalds wrote:

And I think our horrible "kernel threads return to user space when
done" is absolutely horrifically nasty. Maybe of the clever sort, but
mostly of the historical horror sort.

How would you prefer to handle that, then?  Separate magical path from
kernel_execve() to switch to userland?  We used to have something of
that sort, and that had been a real horror...

As it is, it's "kernel thread is spawned at the point similar to
ret_from_fork(), runs the payload (which almost never returns) and
then proceeds out to userland, same way fork(2) would've done."
That way kernel_execve() doesn't have to do anything magical.

Al, digging through the old notes and current call graph...

	FWIW, the major assumption back then had been that get_signal(),
signal_delivered() and all associated machinery (including coredumps)
runs *only* from SIGPENDING/NOTIFY_SIGNAL handling.

	And "has complete registers on stack" is only a part of that;
there was other fun stuff in the area ;-/  Do we want coredumps for
those, and if we do, will the de_thread stuff work there?

Do we want coredumps from processes that use io_uring? yes
Exactly what we want from io_uring threads is less clear.  We can't
really give much that is meaningful beyond the thread ids of the
io_uring threads.

What problems do are you seeing beyond the missing registers on the
stack for kernel threads?

I don't immediately see the connection between coredumps and de_thread.

The function de_thread arranges for the fatal_signal_pending to be true,
and that should work just fine for io_uring threads.  The io_uring
threads process the fatal_signal with get_signal and then proceed to
exit eventually calling do_exit.

I would like to see the testing in cases when the io-uring thread is
the one getting hit by initial signal and when it's the normal one
with associated io-uring ones.  The thread-collecting logics at least
used to depend upon fairly subtle assumptions, and "kernel threads
obviously can't show up as candidates" used to narrow the analysis
down...

In any case, WTF would we allow reads or writes to *any* registers of
such threads?  It's not as simple as "just return zeroes", BTW - the
values allowed in special registers might have non-trivial constraints
on them.  The same goes for coredump - we don't _have_ registers to
dump for those, period.

Looks like the first things to do would be
	* prohibit ptrace accessing any regsets of worker threads
	* make coredump skip all register notes for those

Skipping register notes is fine.  Prohibiting ptrace access to any
regsets of worker threads is interesting.  I think that was tried and
shown to confuse gdb.  So the conclusion was just to provide a fake set
of registers.

Which has appears to work up to the point of dealing with architectures
that have their magic caller-saved optimization (like alpha and m68k),
and no check that all of the registers were saved when accessed.  Adding
a dummy switch stack frame for the kernel threads on those architectures
looks like a good/cheap solution at first glance.

Note, BTW, that kernel_thread() and kernel_execve() do *NOT* step into
ptrace_notify() - explicit CLONE_UNTRACED for the former and zero
current->ptrace in the caller of the latter.  So fork and exec side
has ptrace_event() crap limited to real syscalls.

That is where I thought we were.  Thanks for confirming that.

It's seccomp[1] and exit-related stuff that are messy...

[1] "never trust somebody who introduces himself as Honest Joe and keeps
carping on that all the time"; c.f. __secure_computing(), CONFIG_INTEGRITY,
etc.