On Tue, May 30, 2017 at 04:39:41PM +0200, Michal Hocko wrote: > I sysctl for the mapcount can be increased, right? I also assume that > those vmas will get merged after the post copy is done. Assuming you enlarge the sysctl to the worst possible case, with 64bit address space you can have billions of VMAs if you're migrating 4T of RAM and you're unlucky and the address space gets fragmented. The unswappable kernel memory overhead would be relatively large (i.e. dozen gigabytes of RAM in vm_area_struct slab), and each find_vma operation would need to walk ~40 steps across that large vma rbtree. There's a reason the sysctl exist. Not to tell all those unnecessary vma mangling operations would be protected by the mmap_sem for writing. Not creating a ton of vmas and enabling vma-less pte mangling with a single large vma and only using mmap_sem for reading during all the pte mangling, is one of the primary design motivations for userfaultfd. > I understand that part but it sounds awfully one purpose thing to me. > Are we going to add other MADVISE_RESET_$FOO to clear other flags just > because we can race in this specific use case? Those already exists, see for example MADV_NORMAL, clearing ~VM_RAND_READ & ~VM_SEQ_READ after calling MADV_SEQUENTIAL or MADV_RANDOM. Or MADV_DOFORK after MADV_DONTFORK. MADV_DONTDUMP after MADV_DODUMP. Etc.. > But we already have MADV_HUGEPAGE, MADV_NOHUGEPAGE and prctl to > enable/disable thp. Doesn't that sound little bit too much for a single > feature to you? MADV_NOHUGEPAGE doesn't mean clearing the flag set with MADV_HUGEPAGE. MADV_NOHUGEPAGE disables THP on the region if the global sysfs "enabled" tune is set to "always". MADV_HUGEPAGE enables THP if the global "enabled" sysfs tune is set to "madvise". The two MADV_NOHUGEPAGE and MADV_HUGEPAGE are needed to leverage the three-way setting of "never" "madvise" "always" of the global tune. The "madvise" global tune exists if you want to save RAM and you don't care much about performance but still allowing apps like QEMU where no memory is lost by enabling THP, to use THP. There's no way to clear either of those two flags and bring back the default behavior of the global sysfs tune, so it's not redundant at the very least. -- To unsubscribe from this list: send the line "unsubscribe linux-api" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html