On Tue, Dec 3, 2024 at 7:05 PM Lorenzo Stoakes <lorenzo.stoakes@xxxxxxxxxx> wrote: > Right now fs/exec.c invokes expand_downwards(), an otherwise internal > implementation detail of the VMA logic in order to ensure that an arg page > can be obtained by get_user_pages_remote(). > > In order to be able to move the stack expansion logic into mm/vma.c in > order to make it available to userland testing we need to find an > alternative approach here. > > We do so by providing the mmap_read_lock_maybe_expand() function which also > helpfully documents what get_arg_page() is doing here and adds an > additional check against VM_GROWSDOWN to make explicit that the stack > expansion logic is only invoked when the VMA is indeed a downward-growing > stack. > > This allows expand_downwards() to become a static function. > > Importantly, the VMA referenced by mmap_read_maybe_expand() must NOT be > currently user-visible in any way, that is place within an rmap or VMA > tree. It must be a newly allocated VMA. > > This is the case when exec invokes this function. > > Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@xxxxxxxxxx> > --- > fs/exec.c | 14 +++--------- > include/linux/mm.h | 5 ++--- > mm/mmap.c | 54 +++++++++++++++++++++++++++++++++++++++++++++- > 3 files changed, 58 insertions(+), 15 deletions(-) > > diff --git a/fs/exec.c b/fs/exec.c > index 98cb7ba9983c..1e1f79c514de 100644 > --- a/fs/exec.c > +++ b/fs/exec.c > @@ -205,18 +205,10 @@ static struct page *get_arg_page(struct linux_binprm *bprm, unsigned long pos, > /* > * Avoid relying on expanding the stack down in GUP (which > * does not work for STACK_GROWSUP anyway), and just do it > - * by hand ahead of time. > + * ahead of time. > */ > - if (write && pos < vma->vm_start) { > - mmap_write_lock(mm); > - ret = expand_downwards(vma, pos); > - if (unlikely(ret < 0)) { > - mmap_write_unlock(mm); > - return NULL; > - } > - mmap_write_downgrade(mm); > - } else > - mmap_read_lock(mm); > + if (!mmap_read_lock_maybe_expand(mm, vma, pos, write)) > + return NULL; [...] > +/* > + * Obtain a read lock on mm->mmap_lock, if the specified address is below the > + * start of the VMA, the intent is to perform a write, and it is a > + * downward-growing stack, then attempt to expand the stack to contain it. > + * > + * This function is intended only for obtaining an argument page from an ELF > + * image, and is almost certainly NOT what you want to use for any other > + * purpose. > + * > + * IMPORTANT - VMA fields are accessed without an mmap lock being held, so the > + * VMA referenced must not be linked in any user-visible tree, i.e. it must be a > + * new VMA being mapped. > + * > + * The function assumes that addr is either contained within the VMA or below > + * it, and makes no attempt to validate this value beyond that. > + * > + * Returns true if the read lock was obtained and a stack was perhaps expanded, > + * false if the stack expansion failed. > + * > + * On stack expansion the function temporarily acquires an mmap write lock > + * before downgrading it. > + */ > +bool mmap_read_lock_maybe_expand(struct mm_struct *mm, > + struct vm_area_struct *new_vma, > + unsigned long addr, bool write) > +{ > + if (!write || addr >= new_vma->vm_start) { > + mmap_read_lock(mm); > + return true; > + } > + > + if (!(new_vma->vm_flags & VM_GROWSDOWN)) > + return false; > + > + mmap_write_lock(mm); > + if (expand_downwards(new_vma, addr)) { > + mmap_write_unlock(mm); > + return false; > + } > + > + mmap_write_downgrade(mm); > + return true; > +} Random thought: For write==1, this looks a bit like lock_mm_and_find_vma(mm, addr, NULL), which needs similar stack expansion logic for handling userspace faults. But it's for a sufficiently different situation that maybe it makes sense to keep it like you did it, as a separate function...