On Sun 26-11-17 17:09:32, Mikael Pettersson wrote: > The `vm.max_map_count' sysctl limit is IMO useless and confusing, so > this patch disables it. > > - Old ELF had a limit of 64K segments, making core dumps from processes > with more mappings than that problematic, but that was fixed in March > 2010 ("elf coredump: add extended numbering support"). > > - There are no internal data structures sized by this limit, making it > entirely artificial. each mapping has its vma structure and that in turn can be tracked by other data structures so this is not entirely true. > - When viewed as a limit on memory consumption, it is ineffective since > the number of mappings does not correspond directly to the amount of > memory consumed, since each mapping is variable-length. > > - Reaching the limit causes various memory management system calls to > fail with ENOMEM, which is a lie. Combined with the unpredictability > of the number of mappings in a process, especially when non-trivial > memory management or heavy file mapping is used, it can be difficult > to reproduce these events and debug them. It's also confusing to get > ENOMEM when you know you have lots of free RAM. > > This limit was apparently introduced in the 2.1.80 kernel (first as a > compile-time constant, later changed to a sysctl), but I haven't been > able to find any description for it in Git or the LKML archives, so I > don't know what the original motivation was. > > I've kept the kernel tunable to not break the API towards user-space, > but it's a no-op now. Also the distinction between split_vma() and > __split_vma() disappears, so they are merged. Could you be more explicit about _why_ we need to remove this tunable? I am not saying I disagree, the removal simplifies the code but I do not really see any justification here. > Tested on x86_64 with Fedora 26 user-space. Also built an ARM NOMMU > kernel to make sure NOMMU compiles and links cleanly. > > Signed-off-by: Mikael Pettersson <mikpelinux@xxxxxxxxx> > --- > Documentation/sysctl/vm.txt | 17 +------------- > Documentation/vm/ksm.txt | 4 ---- > Documentation/vm/remap_file_pages.txt | 4 ---- > fs/binfmt_elf.c | 4 ---- > include/linux/mm.h | 23 ------------------- > kernel/sysctl.c | 3 +++ > mm/madvise.c | 12 ++-------- > mm/mmap.c | 42 ++++++----------------------------- > mm/mremap.c | 7 ------ > mm/nommu.c | 3 --- > mm/util.c | 1 - > 11 files changed, 13 insertions(+), 107 deletions(-) > > diff --git a/Documentation/sysctl/vm.txt b/Documentation/sysctl/vm.txt > index b920423f88cb..0fcb511d07e6 100644 > --- a/Documentation/sysctl/vm.txt > +++ b/Documentation/sysctl/vm.txt > @@ -35,7 +35,7 @@ Currently, these files are in /proc/sys/vm: > - laptop_mode > - legacy_va_layout > - lowmem_reserve_ratio > -- max_map_count > +- max_map_count (unused, kept for backwards compatibility) > - memory_failure_early_kill > - memory_failure_recovery > - min_free_kbytes > @@ -400,21 +400,6 @@ The minimum value is 1 (1/1 -> 100%). > > ============================================================== > > -max_map_count: > - > -This file contains the maximum number of memory map areas a process > -may have. Memory map areas are used as a side-effect of calling > -malloc, directly by mmap, mprotect, and madvise, and also when loading > -shared libraries. > - > -While most applications need less than a thousand maps, certain > -programs, particularly malloc debuggers, may consume lots of them, > -e.g., up to one or two maps per allocation. > - > -The default value is 65536. > - > -============================================================= > - > memory_failure_early_kill: > > Control how to kill processes when uncorrected memory error (typically > diff --git a/Documentation/vm/ksm.txt b/Documentation/vm/ksm.txt > index 6686bd267dc9..4a917f88cb11 100644 > --- a/Documentation/vm/ksm.txt > +++ b/Documentation/vm/ksm.txt > @@ -38,10 +38,6 @@ the range for whenever the KSM daemon is started; even if the range > cannot contain any pages which KSM could actually merge; even if > MADV_UNMERGEABLE is applied to a range which was never MADV_MERGEABLE. > > -If a region of memory must be split into at least one new MADV_MERGEABLE > -or MADV_UNMERGEABLE region, the madvise may return ENOMEM if the process > -will exceed vm.max_map_count (see Documentation/sysctl/vm.txt). > - > Like other madvise calls, they are intended for use on mapped areas of > the user address space: they will report ENOMEM if the specified range > includes unmapped gaps (though working on the intervening mapped areas), > diff --git a/Documentation/vm/remap_file_pages.txt b/Documentation/vm/remap_file_pages.txt > index f609142f406a..85985a89f05d 100644 > --- a/Documentation/vm/remap_file_pages.txt > +++ b/Documentation/vm/remap_file_pages.txt > @@ -21,7 +21,3 @@ systems are widely available. > The syscall is deprecated and replaced it with an emulation now. The > emulation creates new VMAs instead of nonlinear mappings. It's going to > work slower for rare users of remap_file_pages() but ABI is preserved. > - > -One side effect of emulation (apart from performance) is that user can hit > -vm.max_map_count limit more easily due to additional VMAs. See comment for > -DEFAULT_MAX_MAP_COUNT for more details on the limit. > diff --git a/fs/binfmt_elf.c b/fs/binfmt_elf.c > index 83732fef510d..8e870b6e4ad9 100644 > --- a/fs/binfmt_elf.c > +++ b/fs/binfmt_elf.c > @@ -2227,10 +2227,6 @@ static int elf_core_dump(struct coredump_params *cprm) > elf = kmalloc(sizeof(*elf), GFP_KERNEL); > if (!elf) > goto out; > - /* > - * The number of segs are recored into ELF header as 16bit value. > - * Please check DEFAULT_MAX_MAP_COUNT definition when you modify here. > - */ > segs = current->mm->map_count; > segs += elf_core_extra_phdrs(); > > diff --git a/include/linux/mm.h b/include/linux/mm.h > index ee073146aaa7..cf545264eb8b 100644 > --- a/include/linux/mm.h > +++ b/include/linux/mm.h > @@ -104,27 +104,6 @@ extern int mmap_rnd_compat_bits __read_mostly; > #define mm_zero_struct_page(pp) ((void)memset((pp), 0, sizeof(struct page))) > #endif > > -/* > - * Default maximum number of active map areas, this limits the number of vmas > - * per mm struct. Users can overwrite this number by sysctl but there is a > - * problem. > - * > - * When a program's coredump is generated as ELF format, a section is created > - * per a vma. In ELF, the number of sections is represented in unsigned short. > - * This means the number of sections should be smaller than 65535 at coredump. > - * Because the kernel adds some informative sections to a image of program at > - * generating coredump, we need some margin. The number of extra sections is > - * 1-3 now and depends on arch. We use "5" as safe margin, here. > - * > - * ELF extended numbering allows more than 65535 sections, so 16-bit bound is > - * not a hard limit any more. Although some userspace tools can be surprised by > - * that. > - */ > -#define MAPCOUNT_ELF_CORE_MARGIN (5) > -#define DEFAULT_MAX_MAP_COUNT (USHRT_MAX - MAPCOUNT_ELF_CORE_MARGIN) > - > -extern int sysctl_max_map_count; > - > extern unsigned long sysctl_user_reserve_kbytes; > extern unsigned long sysctl_admin_reserve_kbytes; > > @@ -2134,8 +2113,6 @@ extern struct vm_area_struct *vma_merge(struct mm_struct *, > unsigned long vm_flags, struct anon_vma *, struct file *, pgoff_t, > struct mempolicy *, struct vm_userfaultfd_ctx); > extern struct anon_vma *find_mergeable_anon_vma(struct vm_area_struct *); > -extern int __split_vma(struct mm_struct *, struct vm_area_struct *, > - unsigned long addr, int new_below); > extern int split_vma(struct mm_struct *, struct vm_area_struct *, > unsigned long addr, int new_below); > extern int insert_vm_struct(struct mm_struct *, struct vm_area_struct *); > diff --git a/kernel/sysctl.c b/kernel/sysctl.c > index 557d46728577..caced68ff0d0 100644 > --- a/kernel/sysctl.c > +++ b/kernel/sysctl.c > @@ -110,6 +110,9 @@ extern int pid_max_min, pid_max_max; > extern int percpu_pagelist_fraction; > extern int latencytop_enabled; > extern unsigned int sysctl_nr_open_min, sysctl_nr_open_max; > +#ifdef CONFIG_MMU > +static int sysctl_max_map_count = 65530; /* obsolete, kept for backwards compatibility */ > +#endif > #ifndef CONFIG_MMU > extern int sysctl_nr_trim_pages; > #endif > diff --git a/mm/madvise.c b/mm/madvise.c > index 375cf32087e4..f63834f59ca7 100644 > --- a/mm/madvise.c > +++ b/mm/madvise.c > @@ -147,11 +147,7 @@ static long madvise_behavior(struct vm_area_struct *vma, > *prev = vma; > > if (start != vma->vm_start) { > - if (unlikely(mm->map_count >= sysctl_max_map_count)) { > - error = -ENOMEM; > - goto out; > - } > - error = __split_vma(mm, vma, start, 1); > + error = split_vma(mm, vma, start, 1); > if (error) { > /* > * madvise() returns EAGAIN if kernel resources, such as > @@ -164,11 +160,7 @@ static long madvise_behavior(struct vm_area_struct *vma, > } > > if (end != vma->vm_end) { > - if (unlikely(mm->map_count >= sysctl_max_map_count)) { > - error = -ENOMEM; > - goto out; > - } > - error = __split_vma(mm, vma, end, 0); > + error = split_vma(mm, vma, end, 0); > if (error) { > /* > * madvise() returns EAGAIN if kernel resources, such as > diff --git a/mm/mmap.c b/mm/mmap.c > index 924839fac0e6..e821d9c4395d 100644 > --- a/mm/mmap.c > +++ b/mm/mmap.c > @@ -1354,10 +1354,6 @@ unsigned long do_mmap(struct file *file, unsigned long addr, > if ((pgoff + (len >> PAGE_SHIFT)) < pgoff) > return -EOVERFLOW; > > - /* Too many mappings? */ > - if (mm->map_count > sysctl_max_map_count) > - return -ENOMEM; > - > /* Obtain the address to map to. we verify (or select) it and ensure > * that it represents a valid section of the address space. > */ > @@ -2546,11 +2542,11 @@ detach_vmas_to_be_unmapped(struct mm_struct *mm, struct vm_area_struct *vma, > } > > /* > - * __split_vma() bypasses sysctl_max_map_count checking. We use this where it > - * has already been checked or doesn't make sense to fail. > + * Split a vma into two pieces at address 'addr', a new vma is allocated > + * either for the first part or the tail. > */ > -int __split_vma(struct mm_struct *mm, struct vm_area_struct *vma, > - unsigned long addr, int new_below) > +int split_vma(struct mm_struct *mm, struct vm_area_struct *vma, > + unsigned long addr, int new_below) > { > struct vm_area_struct *new; > int err; > @@ -2612,19 +2608,6 @@ int __split_vma(struct mm_struct *mm, struct vm_area_struct *vma, > return err; > } > > -/* > - * Split a vma into two pieces at address 'addr', a new vma is allocated > - * either for the first part or the tail. > - */ > -int split_vma(struct mm_struct *mm, struct vm_area_struct *vma, > - unsigned long addr, int new_below) > -{ > - if (mm->map_count >= sysctl_max_map_count) > - return -ENOMEM; > - > - return __split_vma(mm, vma, addr, new_below); > -} > - > /* Munmap is split into 2 main parts -- this part which finds > * what needs doing, and the areas themselves, which do the > * work. This now handles partial unmappings. > @@ -2665,15 +2648,7 @@ int do_munmap(struct mm_struct *mm, unsigned long start, size_t len, > if (start > vma->vm_start) { > int error; > > - /* > - * Make sure that map_count on return from munmap() will > - * not exceed its limit; but let map_count go just above > - * its limit temporarily, to help free resources as expected. > - */ > - if (end < vma->vm_end && mm->map_count >= sysctl_max_map_count) > - return -ENOMEM; > - > - error = __split_vma(mm, vma, start, 0); > + error = split_vma(mm, vma, start, 0); > if (error) > return error; > prev = vma; > @@ -2682,7 +2657,7 @@ int do_munmap(struct mm_struct *mm, unsigned long start, size_t len, > /* Does it split the last one? */ > last = find_vma(mm, end); > if (last && end > last->vm_start) { > - int error = __split_vma(mm, last, end, 1); > + int error = split_vma(mm, last, end, 1); > if (error) > return error; > } > @@ -2694,7 +2669,7 @@ int do_munmap(struct mm_struct *mm, unsigned long start, size_t len, > * will remain splitted, but userland will get a > * highly unexpected error anyway. This is no > * different than the case where the first of the two > - * __split_vma fails, but we don't undo the first > + * split_vma fails, but we don't undo the first > * split, despite we could. This is unlikely enough > * failure that it's not worth optimizing it for. > */ > @@ -2915,9 +2890,6 @@ static int do_brk_flags(unsigned long addr, unsigned long request, unsigned long > if (!may_expand_vm(mm, flags, len >> PAGE_SHIFT)) > return -ENOMEM; > > - if (mm->map_count > sysctl_max_map_count) > - return -ENOMEM; > - > if (security_vm_enough_memory_mm(mm, len >> PAGE_SHIFT)) > return -ENOMEM; > > diff --git a/mm/mremap.c b/mm/mremap.c > index 049470aa1e3e..5544dd3e6e10 100644 > --- a/mm/mremap.c > +++ b/mm/mremap.c > @@ -278,13 +278,6 @@ static unsigned long move_vma(struct vm_area_struct *vma, > bool need_rmap_locks; > > /* > - * We'd prefer to avoid failure later on in do_munmap: > - * which may split one vma into three before unmapping. > - */ > - if (mm->map_count >= sysctl_max_map_count - 3) > - return -ENOMEM; > - > - /* > * Advise KSM to break any KSM pages in the area to be moved: > * it would be confusing if they were to turn up at the new > * location, where they happen to coincide with different KSM > diff --git a/mm/nommu.c b/mm/nommu.c > index 17c00d93de2e..0f6d37be4797 100644 > --- a/mm/nommu.c > +++ b/mm/nommu.c > @@ -1487,9 +1487,6 @@ int split_vma(struct mm_struct *mm, struct vm_area_struct *vma, > if (vma->vm_file) > return -ENOMEM; > > - if (mm->map_count >= sysctl_max_map_count) > - return -ENOMEM; > - > region = kmem_cache_alloc(vm_region_jar, GFP_KERNEL); > if (!region) > return -ENOMEM; > diff --git a/mm/util.c b/mm/util.c > index 34e57fae959d..7e757686f186 100644 > --- a/mm/util.c > +++ b/mm/util.c > @@ -516,7 +516,6 @@ EXPORT_SYMBOL_GPL(__page_mapcount); > int sysctl_overcommit_memory __read_mostly = OVERCOMMIT_GUESS; > int sysctl_overcommit_ratio __read_mostly = 50; > unsigned long sysctl_overcommit_kbytes __read_mostly; > -int sysctl_max_map_count __read_mostly = DEFAULT_MAX_MAP_COUNT; > unsigned long sysctl_user_reserve_kbytes __read_mostly = 1UL << 17; /* 128MB */ > unsigned long sysctl_admin_reserve_kbytes __read_mostly = 1UL << 13; /* 8MB */ > > -- > 2.13.6 > > -- > To unsubscribe, send a message with 'unsubscribe linux-mm' in > the body to majordomo@xxxxxxxxx. For more info on Linux MM, > see: http://www.linux-mm.org/ . > Don't email: <a href=mailto:"dont@xxxxxxxxx"> email@xxxxxxxxx </a> -- Michal Hocko SUSE Labs