On Mon, Aug 30, 2010 at 03:06:16AM -0700, Roland McGrath wrote: > And I say, if your userland process could really allocate another 200GB, > then more power to you, you can do it with an exec too. If you could do > the same with a userland stack allocation, and spend all that time in > strlen calls and then memcpy, you can do it inside execve too. If it > takes days, that's what you asked for, and it's your process. It just > ought to be every bit (or near enough) as preemptible and interruptible > as that normal userland activity ought to be. This makes sense to me. However, introducing a new preemption point may violate assumptions under which the code was written and reviewed in the past. In the worst case, we'd introduce/expose race conditions allowing for privilege escalation. > So, perhaps we want this (count already has a cond_resched in its loop): Good point re: count() already having this (I think it did not in 2.2). > @@ -400,6 +403,10 @@ static int copy_strings(int argc, const > int len; > unsigned long pos; > > + if (signal_pending(current)) > + return -ERESTARTNOINTR; > + cond_resched(); So, in current kernels, you're making it possible for more kinds of things to change after prepare_binprm() but before search_binary_handler(). We'd need to check for possible implications of this. I must admit I am not familiar with what additional kinds of things may change when execution is preempted. This made a significant difference in some much older kernels (many years ago), but now that the kernel makes a lot less use of locking most things may be changed by another CPU even without preemption. So does anyone have a list of what additional risks we're exposed to, if any, when we allow preemption in current kernels? > Has someone reported this BUG_ON failure mode with a reproducer? 64bit_dos.c was supposed to be the reproducer, and I managed to get it to work (as I've documented in another message earlier today). The prerequisites appeared to be (some of these might be specific to my tests, though): - 64-bit kernel with 32-bit userland support (e.g., CONFIG_IA32_EMULATION); - 64-bit build of 64bit_dos.c; - 32-bit build of the target program; - no dynamic linking in the target program; - "ulimit -s unlimited" before running the reproducer program; - over 3 GB of RAM in the system. > [...] Rather than better enabling OOM killing, I think what really > makes sense is for the nascent mm to be marked such that allocations in > it (they'll be from get_arg_page->get_user_pages->handle_mm_fault) just > fail with ENOMEM before it resorts to the OOM killer (or perhaps even to > very aggressive pageout). That should percolate back to the execve just > failing with ENOMEM, which is nicer than OOM kill even if the OOM killer > actually does pick exactly and only the right target. I agree. Thanks, Alexander -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html