Re: [PATCH] exec argument expansion can inappropriately trigger OOM-killer

Solar Designer <solar@xxxxxxxxxxxx> · Mon, 30 Aug 2010 23:48:06 +0400

On Mon, Aug 30, 2010 at 03:06:16AM -0700, Roland McGrath wrote:
> And I say, if your userland process could really allocate another 200GB,
> then more power to you, you can do it with an exec too.  If you could do
> the same with a userland stack allocation, and spend all that time in
> strlen calls and then memcpy, you can do it inside execve too.  If it
> takes days, that's what you asked for, and it's your process.  It just
> ought to be every bit (or near enough) as preemptible and interruptible
> as that normal userland activity ought to be.

This makes sense to me.  However, introducing a new preemption point
may violate assumptions under which the code was written and reviewed
in the past.  In the worst case, we'd introduce/expose race conditions
allowing for privilege escalation.

> So, perhaps we want this (count already has a cond_resched in its loop):

Good point re: count() already having this (I think it did not in 2.2).

> @@ -400,6 +403,10 @@ static int copy_strings(int argc, const 
>  		int len;
>  		unsigned long pos;
>  
> +		if (signal_pending(current))
> +			return -ERESTARTNOINTR;
> +		cond_resched();

So, in current kernels, you're making it possible for more kinds of
things to change after prepare_binprm() but before
search_binary_handler().  We'd need to check for possible implications
of this.

I must admit I am not familiar with what additional kinds of things may
change when execution is preempted.  This made a significant difference
in some much older kernels (many years ago), but now that the kernel
makes a lot less use of locking most things may be changed by another
CPU even without preemption.  So does anyone have a list of what
additional risks we're exposed to, if any, when we allow preemption in
current kernels?

> Has someone reported this BUG_ON failure mode with a reproducer?

64bit_dos.c was supposed to be the reproducer, and I managed to get it
to work (as I've documented in another message earlier today).  The
prerequisites appeared to be (some of these might be specific to my
tests, though):

- 64-bit kernel with 32-bit userland support (e.g., CONFIG_IA32_EMULATION);
- 64-bit build of 64bit_dos.c;
- 32-bit build of the target program;
- no dynamic linking in the target program;
- "ulimit -s unlimited" before running the reproducer program;
- over 3 GB of RAM in the system.

> [...]  Rather than better enabling OOM killing, I think what really
> makes sense is for the nascent mm to be marked such that allocations in
> it (they'll be from get_arg_page->get_user_pages->handle_mm_fault) just
> fail with ENOMEM before it resorts to the OOM killer (or perhaps even to
> very aggressive pageout).  That should percolate back to the execve just
> failing with ENOMEM, which is nicer than OOM kill even if the OOM killer
> actually does pick exactly and only the right target.

I agree.

Thanks,

Alexander
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html