Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx> writes: > On Tue, Jul 28, 2020 at 6:23 AM Eric W. Biederman <ebiederm@xxxxxxxxxxxx> wrote: >> >> For exec all I care about are user space threads. So it appears the >> freezer infrastructure adds very little. > > Yeah. 99% of the freezer stuff is for just adding the magic notations > for kernel threads, and for any user space threads it seems the wrong > interface. > >> Now to see if I can find another way to divert a task into a slow path >> as it wakes up, so I don't need to manually wrap all of the sleeping >> calls. Something that plays nice with the scheduler. > > The thing is, how many places really care? > > Because I think there are like five of them. And they are all marked > by taking cred_guard_mutex, or the file table lock. > > So it seems really excessive to then create some whole new "let's > serialize every thread", when you actually don't care about any of it, > except for a couple of very very special cases. > > If you care about "thread count stable", you care about exit() and > clone(). You don't care about threads that are happily running - or > sleeping - doing their own thing. > > So trying to catch those threads and freezing them really feels like > entirely the wrong interface. For me stopping the other threads has been a conceptually simple direction that needs exploration even if it doesn't work out. On that note I have something that is just a little longer than the patch I posted that doesn't use the freezer, and leans on the fact that TASK_INTERRUPTBLE and TASK_KILLABLE can occassionaly expect a spurious wake up. Which means (with the right locking) those two states can be transformed into TASK_WAKEKILL to keep sleeping processes sleeping. After that I only need one small test in get_signal to catch the unlikely case of processes running in userspace. I have not figured out TASK_STOPPED or TASK_TRACED yet as they do not handle spurious wake ups. So I think I can stop and keep stopped the other threads in the group without too much code or complexity for other parts of the kernel (assuming spurious wakes ups are allowed). Even with the other threads stopped the code does not simplify as much as I had hoped. The code still has to deal with the existence of the other threads. So while races don't happen and thus the code is easier to understand and make correct the actual work of walking the threads making a count etc still remains. The real sticky widget right now is the unshare_files call. When Al moved the call in fd8328be874f ("[PATCH] sanitize handling of shared descriptor tables in failing execve()") it introduced a regression that causes posix file locks to be improperly removed during exec. Unfortunately no one noticed for about a decade. What caused the regression is that unshare_files is a noop if the usage count is 1. Which means that after de_thread unshare_files only does something if our file table is shared by another process. However when unshare_files is called before de_thread in a multi-threaded process unshare_files will always perform work. For the locks that set fl_owner to current->files unsharing current->files when unnecessary already causes problems, as we now have an unnecessary change of owner during exec. After the unnecessary change in owner the existing locks are eventually removed at the end of bprm_execve: bprm_execve() if (displaced) put_files_struct(displaced) filp_close() locks_remove_posix() /* Which removes the posix locks */ After 12 years moving unshare_files back where it used to be is also problematic, as with the addition of execveat we grew a clear dependency other threads not messing with our open files: /* * Record that a name derived from an O_CLOEXEC fd will be * inaccessible after exec. Relies on having exclusive access to * current->files (due to unshare_files above). */ if (bprm->fdpath && close_on_exec(fd, rcu_dereference_raw(current->files->fdt))) bprm->interp_flags |= BINPRM_FLAGS_PATH_INACCESSIBLE; I have made a cursory look and I don't expect any other issues as the only code in exec that messes with file descriptors is in fs/exec.c Everything else simply uses "struct file *". Testing to see if the file descriptor is close_on_exec is just there to prevent trying to run a script that through a close_on_exec file descriptor, that is part of the path to the script. So it is a quality of implementation issue so the code does not fail later if userspace tries something silly. So I think we can safely just update the comment and say if userspace messes with the file descriptor they pass to exec during exec userspace can keep both pieces, as it is a self inflicted problem. All of the other issues I see with reverting the location where the file table is unshared also look like userspace shooting themselves in the foot and not a problem with correctness of kernel code. Which is a long way of saying that I have just convinced myself to effectively revert fd8328be874f ("[PATCH] sanitize handling of shared descriptor tables in failing execve()") A memory allocation failure after the point of no return is the only reason to avoid doing this, and since the unshare is expected to be a noop most of the time that doesn't look like a downside either. Assuming I don't find anything else I think I will kick down the road a bit the problem of stopping other threads during exec. I can handle unshare_files and seccomp without it. There still might be something I can not solve another way, but until I find it I will put this on the back burner. Eric