Re: [PATCH] mm/mmap: Map MAP_STACK to VM_STACK

Waiman Long <longman@xxxxxxxxxx> · Tue, 18 Apr 2023 21:16:37 -0400

On 4/18/23 17:18, Andrew Morton wrote:
On Tue, 18 Apr 2023 17:02:30 -0400 Waiman Long <longman@xxxxxxxxxx> wrote:

One of the flags of mmap(2) is MAP_STACK to request a memory segment
suitable for a process or thread stack. The kernel currently ignores
this flags. Glibc uses MAP_STACK when mmapping a thread stack. However,
selinux has an execstack check in selinux_file_mprotect() which disallows
a stack VMA to be made executable.

Since MAP_STACK is a noop, it is possible for a stack VMA to be merged
with an adjacent anonymous VMA. With that merging, using mprotect(2)
to change a part of the merged anonymous VMA to make it executable may
fail. This can lead to sporadic failure of applications that need to
make those changes.
"Sporadic failure of applications" sounds quite serious.  Can you
provide more details?

The problem boils down to the fact that it is possible for user code to 
mmap a region of memory and then for the kernel to merge the VMA for 
that memory with the VMA for one of the application's thread stacks. 
This is causing random SEGVs with one of our large customer application.

At a high level, this is what's happening:

 1) App runs creating lots of threads.
 2) It mmap's 256K pages of anonymous memory.
 3) It writes executable code to that memory.
 4) It calls mprotect() with PROT_EXEC on that memory so
    it can subsequently execute the code.

The above mprotect() will fail if the mmap'd region's VMA gets merged 
with the VMA for one of the thread stacks.  That's because the default 
RHEL SELinux policy is to not allow executable stacks.


Did you consider a -stable backport?  I'm unable to judge, because the
description of the userspace effects is so thin,

Yes, stable backport can be considered.



One possible fix is to make sure that a stack VMA will not be merged
with a non-stack anonymous VMA. That requires a vm flag that can be
used to distinguish a stack VMA from a regular anonymous VMA. One
can add a new dummy vm flag for that purpose. However, there is only
1 bit left in the lower 32 bits of vm_flags. Another alternative is
to use an existing vm flag. VM_STACK (= VM_GROWSDOWN for most arches)
can certainly be used for this purpose. The downside is that it is a
slight change in existing behavior.

Making a stack VMA growable by default certainly fits the need of a
process or thread stack. This patch now maps MAP_STACK to VM_STACK to
prevent unwanted merging with adjacent non-stack VMAs and make the VMA
more suitable for being used as a stack.

...

--- a/include/linux/mman.h
+++ b/include/linux/mman.h
@@ -152,6 +152,7 @@ calc_vm_flag_bits(unsigned long flags)
  	return _calc_vm_trans(flags, MAP_GROWSDOWN,  VM_GROWSDOWN ) |
  	       _calc_vm_trans(flags, MAP_LOCKED,     VM_LOCKED    ) |
  	       _calc_vm_trans(flags, MAP_SYNC,	     VM_SYNC      ) |
+	       _calc_vm_trans(flags, MAP_STACK,	     VM_STACK     ) |
  	       arch_calc_vm_flag_bits(flags);
  }
The mmap(2) manpage says

   This flag is currently a no-op on Linux.  However, by employing
   this flag, applications can ensure that they transparently ob- tain
   support if the flag is implemented in the future.  Thus, it is used
   in the glibc threading implementation to allow for the fact that some
   architectures may (later) require special treat- ment for stack
   allocations.  A further reason to employ this flag is portability:
   MAP_STACK exists (and has an effect) on some other systems (e.g.,
   some of the BSDs).

so please propose an update for this?

OK, will do.

Thanks,
Longman