On Thu, 8 Oct 2020, Mike Kravetz wrote: > On 10/7/20 8:21 PM, Hugh Dickins wrote: > > > > Mike, j'accuse... your 5.7 commit c0d0381ade79 ("hugetlbfs: > > use i_mmap_rwsem for more pmd sharing synchronization"), in which > > unmap_and_move_huge_page() now passes the TTU_RMAP_LOCKED flag to > > try_to_unmap(), because it's already holding mapping->i_mmap_rwsem: > > but that is not the right lock to secure an anon_vma lookup. > > Thanks Hugh! Your analysis is correct and the code in that commit is > not correct. I was so focused on the file mapping case, I overlooked > (actually introduced) this issue for anon mappings. > > Let me verify that this indeed is the root cause. However, since > move_pages12 migrated anon hugetlb pages it certainly does look to be > the case. > > > I intended to send a patch, passing TTU_RMAP_LOCKED only in the > > !PageAnon case (and, see vma_adjust(), anon_vma lock conveniently > > nests inside i_mmap_rwsem); but then wondered if i_mmap_rwsem was > > needed in that case or not, so looked deeper into c0d0381ade79. > > > > Hmm, not even you liked it! But the worst of it looks simply > > unnecessary to me, and I hope can be deleted - but better by you > > than by me (in particular, you were trying to kill 1) and 2) birds > > with one stone, and I've always given up on understanding hugetlb's > > reservations: I suspect that side of it is irrelevant here, > > but I wouldn't pretend to be sure). > > > > How could you ever find a PageAnon page in a vma_shareable() area? > > > > It is all rather confusing (vma_shareable() depending on VM_MAYSHARE, > > whereas is_cow_mapping() on VM_SHARED and VM_MAYWRITE: they have to > > be studied together with do_mmap()'s > > vm_flags |= VM_SHARED | VM_MAYSHARE; > > if (!(file->f_mode & FMODE_WRITE)) > > vm_flags &= ~(VM_MAYWRITE | VM_SHARED); > > > > (And let me add to the confusion by admitting that, prior to 3.15's > > cda540ace6a1 "mm: get_user_pages(write,force) refuse to COW in > > shared areas", maybe it was possible to find a PageAnon there.) > > > > But my belief (best confirmed by you running your tests with a > > suitably placed BUG_ON or WARN_ON) is that you'll never find a > > PageAnon in a vma_shareable() area, so will never need try_to_unmap() > > to unshare a pagetable in the PageAnon case, so won't need i_mmap_rwsem > > for PageAnon there, and _get_hugetlb_page_mapping() (your function that > > deduces an address_space from an anon_vma) can just be deleted. > > Yes, it is confusing. Let me look into this. I would be really happy > to delete that ugly function. > > > (And in passing, may I ask what hugetlb_page_mapping_lock_write()'s > > hpage->_mapcount inc and dec are for? You comment it as a hack, > > but don't explain what needs that hack, and I don't see it.) > > We are trying to lock the mapping (mapping->i_mmap_rwsem). We know > mapping is valid, because we obtained it from page_mapping() and it > will remain valid because we have the page locked. Page needs to be > unlocked to unmap. However, we have to drop page lock in order to > acquire i_mmap_rwsem. Once we drop page lock, mapping could become > invalid. So, the code code artifically incs mapcount so that mapping > will remain valid when upmapping page. No, unless you can point me to some other hugetlbfs-does-it-differently (and I didn't see it there in that commit), raising _mapcount does not provide any such protection; but does add the possiblility of a "BUG: Bad page cache" and leak from unaccount_page_cache_page(). Earlier in the day I was trying to work out what to recommend instead, but had to turn aside to something else: I'll try again tomorrow. It's a problem I've faced before in tmpfs, keeping a hold on the mapping while page lock is dropped. Quite awkward: igrab() looks as if it's the right thing to use, but turns out to give no protection against umount. Last time around, I ended up with a stop_eviction count in the shmem inode, which shmem_evict_inode() waits on if necessary. Something like that could be done for hugetlbfs too, but I'd prefer to do it without adding extra, if there is a way. > > As mentioned above, I hope all this can be removed. If you continue to nest page lock inside i_mmap_rwsem for hugetlbfs, then I think that part of hugetlb_page_mapping_lock_write() has to remain. I'd much prefer that hugetlbfs did not reverse the usual nesting, but accept that you had reasons for doing it that way. Hugh