On Sun, 8 May 2011, Linus Torvalds wrote: > On Sun, May 8, 2011 at 11:55 AM, Mikulas Patocka > <mikulas@xxxxxxxxxxxxxxxxxxxxxxxx> wrote: > > > > This patch fixes lvm2 on PA-RISC (and possibly other architectures with > > up-growing stack). lvm2 calculates the number of used pages when locking > > and when unlocking and reports an internal error if the numbers mismatch. > > This patch won't apply on current kernels (including stable) because > of commit a1fde08c74e9 that changed the test of "pages" to instead > just test "flags & FOLL_MLOCK". > > That should be trivial to fix up. > > However, I really don't much like this complex test: > > > static inline int stack_guard_page(struct vm_area_struct *vma, unsigned long addr) > > { > > - return (vma->vm_flags & VM_GROWSDOWN) && > > + return ((vma->vm_flags & VM_GROWSDOWN) && > > (vma->vm_start == addr) && > > - !vma_stack_continue(vma->vm_prev, addr); > > + !vma_stack_continue(vma->vm_prev, addr)) || > > + ((vma->vm_flags & VM_GROWSUP) && > > + (vma->vm_end == addr + PAGE_SIZE) && > > + !vma_stack_growsup_continue(vma->vm_next, addr + PAGE_SIZE)); > > } > > in that format. It gets really hard to read, and I think you'd be > better off writing it as two helper functions (or macros) for the two > cases, and then have > > static inline int stack_guard_page(struct vm_area_struct *vma, > unsigned long addr) > { > return stack_guard_page_growsdown(vma, addr) || > stack_guard_page_growsup(vma, addr); > } > > I'd also like to verify that it doesn't actually generate any extra > code for the common case (iirc VM_GROWSUP is 0 for the architectures > that don't need it, and so the compiler shouldn't generate any extra > code, but I'd like that mentioned and verified explicitly). > > Hmm? > > Other than that it looks ok to me. > > That said, could we please fix LVM to not do that crazy sh*t in the > first place? The STACK_GROWSUP case is never going to have a lot of > testing, this is just sad. LVM reads process maps from /proc/self/maps and locks them with mlock. Why it doesn't use mlockall()? Because glibc maps all locales to the process. Glibc packs all locales to a 100MB file and maps that file to every process. Even if the process uses just one locale, glibc maps all. So, when LVM used mlockall, it consumed >100MB memory and it caused out-of-memory problems in system installers. So, alternate way of locking was added to LVM --- read all maps and lock them, except for the glibc locale file. The real fix would be to fix glibc not to map 100MB to every process. > Linus > This is updated patch. I tested it on x86-64 and it doesn't change generated code. Mikulas --- Don't lock guardpage if the stack is growing up Linux kernel excludes guard page when performing mlock on a VMA with down-growing stack. However, some architectures have up-growing stack and locking the guard page should be excluded in this case too. This patch fixes lvm2 on PA-RISC (and possibly other architectures with up-growing stack). lvm2 calculates number of used pages when locking and when unlocking and reports an internal error if the numbers mismatch. Signed-off-by: Mikulas Patocka <mikulas@xxxxxxxxxxxxxxxxxxxxxxxx> --- include/linux/mm.h | 10 +++++++++- mm/memory.c | 21 ++++++++++++++++++--- 2 files changed, 27 insertions(+), 4 deletions(-) Index: linux-2.6/include/linux/mm.h =================================================================== --- linux-2.6.orig/include/linux/mm.h 2011-05-09 12:33:50.000000000 +0200 +++ linux-2.6/include/linux/mm.h 2011-05-09 12:42:05.000000000 +0200 @@ -1011,11 +1011,19 @@ int set_page_dirty_lock(struct page *pag int clear_page_dirty_for_io(struct page *page); /* Is the vma a continuation of the stack vma above it? */ -static inline int vma_stack_continue(struct vm_area_struct *vma, unsigned long addr) +static inline int vma_stack_growsdown_continue(struct vm_area_struct *vma, + unsigned long addr) { return vma && (vma->vm_end == addr) && (vma->vm_flags & VM_GROWSDOWN); } +/* Is the vma a continuation of the stack vma below it? */ +static inline int vma_stack_growsup_continue(struct vm_area_struct *vma, + unsigned long addr) +{ + return vma && (vma->vm_start == addr) && (vma->vm_flags & VM_GROWSUP); +} + extern unsigned long move_page_tables(struct vm_area_struct *vma, unsigned long old_addr, struct vm_area_struct *new_vma, unsigned long new_addr, unsigned long len); Index: linux-2.6/mm/memory.c =================================================================== --- linux-2.6.orig/mm/memory.c 2011-05-09 12:33:51.000000000 +0200 +++ linux-2.6/mm/memory.c 2011-05-09 12:41:38.000000000 +0200 @@ -1410,11 +1410,21 @@ no_page_table: return page; } -static inline int stack_guard_page(struct vm_area_struct *vma, unsigned long addr) +static inline int stack_guard_page_growsdown(struct vm_area_struct *vma, + unsigned long addr) { return (vma->vm_flags & VM_GROWSDOWN) && (vma->vm_start == addr) && - !vma_stack_continue(vma->vm_prev, addr); + !vma_stack_growsdown_continue(vma->vm_prev, addr); +} + + +static inline int stack_guard_page_growsup(struct vm_area_struct *vma, + unsigned long addr) +{ + return (vma->vm_flags & VM_GROWSUP) && + (vma->vm_end == addr + PAGE_SIZE) && + !vma_stack_growsup_continue(vma->vm_next, addr + PAGE_SIZE); } /** @@ -1554,13 +1564,18 @@ int __get_user_pages(struct task_struct /* * For mlock, just skip the stack guard page. */ - if ((gup_flags & FOLL_MLOCK) && stack_guard_page(vma, start)) + if ((gup_flags & FOLL_MLOCK) && + stack_guard_page_growsdown(vma, start)) goto next_page; do { struct page *page; unsigned int foll_flags = gup_flags; + if ((gup_flags & FOLL_MLOCK) && + stack_guard_page_growsup(vma, start)) + goto next_page; + /* * If we have a pending SIGKILL, don't keep faulting * pages and potentially allocating memory.