On Mon, 3 Aug 2020, Kirill A. Shutemov wrote: > On Sun, Aug 02, 2020 at 12:16:53PM -0700, Hugh Dickins wrote: > > Only once have I seen this scenario (and forgot even to notice what > > forced the eventual crash): a sequence of "BUG: Bad page map" alerts > > from vm_normal_page(), from zap_pte_range() servicing exit_mmap(); > > pmd:00000000, pte values corresponding to data in physical page 0. > > > > The pte mappings being zapped in this case were supposed to be from a > > huge page of ext4 text (but could as well have been shmem): my belief > > is that it was racing with collapse_file()'s retract_page_tables(), > > found *pmd pointing to a page table, locked it, but *pmd had become > > 0 by the time start_pte was decided. > > > > In most cases, that possibility is excluded by holding mmap lock; > > but exit_mmap() proceeds without mmap lock. Most of what's run by > > khugepaged checks khugepaged_test_exit() after acquiring mmap lock: > > khugepaged_collapse_pte_mapped_thps() and hugepage_vma_revalidate() > > do so, for example. But retract_page_tables() did not: fix that > > (using an mm variable instead of vma->vm_mm repeatedly). > > Hm. I'm not sure I follow. vma->vm_mm has to be valid as long as we hold > i_mmap lock, no? Unlinking a VMA requires it. Ah, my wording is misleading, yes. That comment "(using an mm variable instead of vma->vm_mm repeatedly)" was nothing more than a note, that the patch is bigger than it could be, because I decided to use an mm variable, instead of vma->vm_mm repeatedly. But it looks as if I'm saying there used to be a need for READ_ONCE() or something, and by using the mm variable I was fixing the problem. No, sorry: delete that line now the point is made: the mm variable is just a patch detail, it's not important. The fix (as the subject suggested) is for retract_page_tables() to check khugepaged_test_exit(), after acquiring mmap lock, before doing anything to the page table. Getting the mmap lock serializes with __mmput(), which briefly takes and drops it in __khugepaged_exit(); then the khugepaged_test_exit() check on mm_users makes sure we don't touch the page table once exit_mmap() might reach it, since exit_mmap() will be proceeding without mmap lock, not expecting anyone to be racing with it. (I devised that protocol for ksmd, then Andrea adopted it for khugepaged: back then it was important for these daemons to have a hold on the mm, without an actual reference to mm_users, because that would prevent the OOM killer from reaching exit_mmap(). Nowadays with the OOM reaper, it's probably less crucial to avoid mm_users, but I think still worthwhile.) Thanks a lot for looking at these patches so quickly, Hugh