On 2/6/22 22:30, Hugh Dickins wrote: > We have recommended some applications to mlock their userspace, but that > turns out to be counter-productive: when many processes mlock the same > file, contention on rmap's i_mmap_rwsem can become intolerable at exit: it > is needed for write, to remove any vma mapping that file from rmap's tree; > but hogged for read by those with mlocks calling page_mlock() (formerly > known as try_to_munlock()) on *each* page mapped from the file (the > purpose being to find out whether another process has the page mlocked, > so therefore it should not be unmlocked yet). > > Several optimizations have been made in the past: one is to skip > page_mlock() when mapcount tells that nothing else has this page > mapped; but that doesn't help at all when others do have it mapped. > This time around, I initially intended to add a preliminary search > of the rmap tree for overlapping VM_LOCKED ranges; but that gets > messy with locking order, when in doubt whether a page is actually > present; and risks adding even more contention on the i_mmap_rwsem. > > A solution would be much easier, if only there were space in struct page > for an mlock_count... but actually, most of the time, there is space for > it - an mlocked page spends most of its life on an unevictable LRU, but > since 3.18 removed the scan_unevictable_pages sysctl, that "LRU" has > been redundant. Let's try to reuse its page->lru. > > But leave that until a later patch: in this patch, clear the ground by > removing page_mlock(), and all the infrastructure that has gathered > around it - which mostly hinders understanding, and will make reviewing > new additions harder. Don't mind those old comments about THPs, they > date from before 4.5's refcounting rework: splitting is not a risk here. > > Just keep a minimal version of munlock_vma_page(), as reminder of what it > should attend to (in particular, the odd way PGSTRANDED is counted out of > PGMUNLOCKED), and likewise a stub for munlock_vma_pages_range(). Move > unchanged __mlock_posix_error_return() out of the way, down to above its > caller: this series then makes no further change after mlock_fixup(). > > Signed-off-by: Hugh Dickins <hughd@xxxxxxxxxx> While I understand the reasons to clear the ground first, wonder what are the implications for bisectability - is there a risk of surprising failures? Maybe we should at least explicitly spell out the implications here? IIUC, pages that once become mlocked, will stay mlocked, implicating the Mlocked meminfo counter and inability to reclaim them. But if e.g. a process that did mlockall() exits, its exclusive pages will be freed anyway, so it's not a catastrophic kind of leak, right? Yet it differs from the existing "failure modes" where pages would be left as "stranded" due to failure of being isolated, because they would at least go through TestClearPageMlocked and counters update. > > /* > @@ -413,75 +136,11 @@ static unsigned long __munlock_pagevec_fill(struct pagevec *pvec, > * > * Returns with VM_LOCKED cleared. Callers must be prepared to > * deal with this. > - * > - * We don't save and restore VM_LOCKED here because pages are > - * still on lru. In unmap path, pages might be scanned by reclaim > - * and re-mlocked by page_mlock/try_to_unmap before we unmap and > - * free them. This will result in freeing mlocked pages. > */ > void munlock_vma_pages_range(struct vm_area_struct *vma, > unsigned long start, unsigned long end) > { > - vma->vm_flags &= VM_LOCKED_CLEAR_MASK; Should we at least keep doing the flags clearing? I haven't check if there are some VM_BUG_ONs that would trip on not cleared, but wouldn't be entirely surprised.