Al Viro <viro@xxxxxxxxxxxxxxxxxx> writes: > On Mon, Jun 21, 2021 at 11:50:56AM -0500, Eric W. Biederman wrote: >> Al Viro <viro@xxxxxxxxxxxxxxxxxx> writes: >> >> > On Mon, Jun 21, 2021 at 01:54:56PM +0000, Al Viro wrote: >> >> On Tue, Jun 15, 2021 at 02:58:12PM -0700, Linus Torvalds wrote: >> >> >> >> > And I think our horrible "kernel threads return to user space when >> >> > done" is absolutely horrifically nasty. Maybe of the clever sort, but >> >> > mostly of the historical horror sort. >> >> >> >> How would you prefer to handle that, then? Separate magical path from >> >> kernel_execve() to switch to userland? We used to have something of >> >> that sort, and that had been a real horror... >> >> >> >> As it is, it's "kernel thread is spawned at the point similar to >> >> ret_from_fork(), runs the payload (which almost never returns) and >> >> then proceeds out to userland, same way fork(2) would've done." >> >> That way kernel_execve() doesn't have to do anything magical. >> >> >> >> Al, digging through the old notes and current call graph... >> > >> > FWIW, the major assumption back then had been that get_signal(), >> > signal_delivered() and all associated machinery (including coredumps) >> > runs *only* from SIGPENDING/NOTIFY_SIGNAL handling. >> > >> > And "has complete registers on stack" is only a part of that; >> > there was other fun stuff in the area ;-/ Do we want coredumps for >> > those, and if we do, will the de_thread stuff work there? >> >> Do we want coredumps from processes that use io_uring? yes >> Exactly what we want from io_uring threads is less clear. We can't >> really give much that is meaningful beyond the thread ids of the >> io_uring threads. >> >> What problems do are you seeing beyond the missing registers on the >> stack for kernel threads? >> >> I don't immediately see the connection between coredumps and de_thread. >> >> The function de_thread arranges for the fatal_signal_pending to be true, >> and that should work just fine for io_uring threads. The io_uring >> threads process the fatal_signal with get_signal and then proceed to >> exit eventually calling do_exit. > > I would like to see the testing in cases when the io-uring thread is > the one getting hit by initial signal and when it's the normal one > with associated io-uring ones. The thread-collecting logics at least > used to depend upon fairly subtle assumptions, and "kernel threads > obviously can't show up as candidates" used to narrow the analysis > down... > > In any case, WTF would we allow reads or writes to *any* registers of > such threads? It's not as simple as "just return zeroes", BTW - the > values allowed in special registers might have non-trivial constraints > on them. The same goes for coredump - we don't _have_ registers to > dump for those, period. > > Looks like the first things to do would be > * prohibit ptrace accessing any regsets of worker threads > * make coredump skip all register notes for those Skipping register notes is fine. Prohibiting ptrace access to any regsets of worker threads is interesting. I think that was tried and shown to confuse gdb. So the conclusion was just to provide a fake set of registers. Which has appears to work up to the point of dealing with architectures that have their magic caller-saved optimization (like alpha and m68k), and no check that all of the registers were saved when accessed. Adding a dummy switch stack frame for the kernel threads on those architectures looks like a good/cheap solution at first glance. > Note, BTW, that kernel_thread() and kernel_execve() do *NOT* step into > ptrace_notify() - explicit CLONE_UNTRACED for the former and zero > current->ptrace in the caller of the latter. So fork and exec side > has ptrace_event() crap limited to real syscalls. That is where I thought we were. Thanks for confirming that. > It's seccomp[1] and exit-related stuff that are messy... > > [1] "never trust somebody who introduces himself as Honest Joe and keeps > carping on that all the time"; c.f. __secure_computing(), CONFIG_INTEGRITY, > etc.