find_extend_vma assumes the caller holds mmap_sem as a reader (explained in expand_downwards()). The path when we are extending the stack VMA to accomodate argv[] pointers happens without the lock. I was not able to cause an mm_struct corruption but BUG_ON(!rwsem_is_locked(&mm->mmap_sem)) in find_extend_vma could be triggered as # <bigfile xargs echo xargs: echo: terminated by signal 11 (bigfile needs to have more than RLIMIT_STACK / sizeof(char *) rows) Other accesses to mm_struct in exec path are protected by mmap_sem, so conservatively, protect also this one. Besides that, explain why we omit mm_struct.arg_lock in the exec(2) path. Cc: Cyrill Gorcunov <gorcunov@xxxxxxxxx> Signed-off-by: Michal Koutný <mkoutny@xxxxxxxx> --- When I was attempting to reduce usage of mmap_sem I came across this unprotected access and increased number of its holders :-/ I'm not sure whether there is a real concurrent writer at this early stages (I considered khugepaged especially as setup_arg_pages invokes khugepaged_enter_vma_merge but we're lucky because khugepaged skips it because of VM_STACK_INCOMPLETE_SETUP). A nicer approach would perhaps be to do all this exec setup when the mm_struct is still not exposed via current->mm (and hence no need to synchronize via mmap_sem). But I didn't look enough into binfmt specific whether it is even doable and worth it. So I'm sending this for a discussion. fs/binfmt_elf.c | 10 +++++++++- fs/exec.c | 3 ++- 2 files changed, 11 insertions(+), 2 deletions(-) diff --git a/fs/binfmt_elf.c b/fs/binfmt_elf.c index 8264b468f283..48e169760a9c 100644 --- a/fs/binfmt_elf.c +++ b/fs/binfmt_elf.c @@ -299,7 +299,11 @@ create_elf_tables(struct linux_binprm *bprm, struct elfhdr *exec, * Grow the stack manually; some architectures have a limit on how * far ahead a user-space access may be in order to grow the stack. */ + if (down_read_killable(¤t->mm->mmap_sem)) + return -EINTR; vma = find_extend_vma(current->mm, bprm->p); + up_read(¤t->mm->mmap_sem); + if (!vma) return -EFAULT; @@ -1123,11 +1127,15 @@ static int load_elf_binary(struct linux_binprm *bprm) goto out; #endif /* ARCH_HAS_SETUP_ADDITIONAL_PAGES */ + /* + * Don't take mm->arg_lock. The concurrent change might happen only + * from prctl_set_mm but after de_thread we are certainly alone here. + */ retval = create_elf_tables(bprm, &loc->elf_ex, load_addr, interp_load_addr); if (retval < 0) goto out; - /* N.B. passed_fileno might not be initialized? */ + current->mm->end_code = end_code; current->mm->start_code = start_code; current->mm->start_data = start_data; diff --git a/fs/exec.c b/fs/exec.c index 89a500bb897a..d5b55c92019a 100644 --- a/fs/exec.c +++ b/fs/exec.c @@ -212,7 +212,8 @@ static struct page *get_arg_page(struct linux_binprm *bprm, unsigned long pos, /* * We are doing an exec(). 'current' is the process - * doing the exec and bprm->mm is the new process's mm. + * doing the exec and bprm->mm is the new process's mm that is not + * shared yet, so no synchronization on mmap_sem. */ ret = get_user_pages_remote(current, bprm->mm, pos, 1, gup_flags, &page, NULL, NULL); -- 2.21.0