> This makes sense to me. However, introducing a new preemption point > may violate assumptions under which the code was written and reviewed > in the past. In the worst case, we'd introduce/expose race conditions > allowing for privilege escalation. This can't be so. There are already many possibilities for preemption in the get_user_pages code paths (called from get_arg_page). > > So, perhaps we want this (count already has a cond_resched in its loop): > > Good point re: count() already having this (I think it did not in 2.2). Indeed, this too is a clear indication that preemption here is already safe. > > + if (signal_pending(current)) > > + return -ERESTARTNOINTR; > > + cond_resched(); > > So, in current kernels, you're making it possible for more kinds of > things to change after prepare_binprm() but before > search_binary_handler(). We'd need to check for possible implications > of this. What "change"? Preemption is already possible, that's nothing new. The only difference is that we might notice TIF_SIGPENDING having been set, and bail out either before or after prepare_binprm, and so never call search_binary_handler at all. The user-visible effect of this could be that a process taking many signals way too fast could get stuck permanently running its signal handlers and never actually really attempting the execve (but perhaps doing some mm setup and arg-copying work and then throwing it away, repeatedly). Before, the speed of the signals might have been such that the process could get into the execve syscall, so it might happen to complete when now it would be more likely to handle another signal before doing the exec instead. That, and debuggers might ever see -ERESTARTNOHAND status in syscall tracing for execve (as they already might for many other syscalls, so they should be fine). Note that get_user_pages() already fails for SIGKILL. So a more conservative change would be just fatal_signal_pending(current) checks. That lets only SIGKILL short-circuit the execve to kill the process, without just letting normal signals interrupt execve like they would interrupt any other system call. That should have exactly no user-visible effects (aside from fairer scheduling in CONFIG_PREEMPT=n kernels, and better responsiveness to SIGKILL in all kernels). But I think it's fine and proper to let all signals interrupt execve normally. > additional risks we're exposed to, if any, when we allow preemption in > current kernels? That's not so much the point, since we already have plenty of places in this code path that have voluntary preemption points. We're just adding more, which improves interactive scheduling performance for the rest of the system when execve takes a long time copying the strings. > - 64-bit kernel with 32-bit userland support (e.g., CONFIG_IA32_EMULATION); > - 64-bit build of 64bit_dos.c; > - 32-bit build of the target program; > - no dynamic linking in the target program; > - "ulimit -s unlimited" before running the reproducer program; > - over 3 GB of RAM in the system. I reproduced it too. The lack of dynamic linking is not really a requirement to get it to hit, though whether it fails in the kernel can vary with the stack randomization. (Sometimes it will instead be the userland dynamic linker dying with a message that mmap failed.) I think the right fix for that case is this: diff --git a/fs/exec.c b/fs/exec.c index 2d94552..0000000 100644 --- a/fs/exec.c +++ b/fs/exec.c @@ -594,6 +594,11 @@ int setup_arg_pages(struct linux_binprm #else stack_top = arch_align_stack(stack_top); stack_top = PAGE_ALIGN(stack_top); + + if (unlikely(stack_top < mmap_min_addr) || + unlikely(vma->vm_end - vma->vm_start >= stack_top - mmap_min_addr)) + return -ENOMEM; + stack_shift = vma->vm_end - stack_top; bprm->p -= stack_shift; That is consistent with the existing CONFIG_STACK_GROWSUP code's check (though without the arbitrary 1GB limit it has). We won't ever get to shift_arg_pages if there isn't space to do the shift. Brad made several of the same points that I made earlier and in this message, so I concur with those. :-) Brad mentioned a possible BUG_ON if the executable's mappings clobber the stack mapping. With the fix above, I'm not aware how that could happen. I think it would just clobber part of the mapping with the MAP_FIXED mapping, meaning it shortens the stack vma from the bottom, or maybe splits it in the middle. Then at user-mode entry point time, the SP will point into some mapped text or data of the executable, and read whatever is there instead of argc and argv[], etc.--it won't be as graceful as an abortive execve, but it will just "safely" crash in some random but normal userland sort of way. Thanks, Roland -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html