On Tue, Jun 01, 2021 at 02:13:21PM -0700, Hugh Dickins wrote: > Running certain tests with a DEBUG_VM kernel would crash within hours, > on the total_mapcount BUG() in split_huge_page_to_list(), while trying > to free up some memory by punching a hole in a shmem huge page: split's > try_to_unmap() was unable to find all the mappings of the page (which, > on a !DEBUG_VM kernel, would then keep the huge page pinned in memory). > > Crash dumps showed two tail pages of a shmem huge page remained mapped > by pte: ptes in a non-huge-aligned vma of a gVisor process, at the end > of a long unmapped range; and no page table had yet been allocated for > the head of the huge page to be mapped into. > > Although designed to handle these odd misaligned huge-page-mapped-by-pte > cases, page_vma_mapped_walk() falls short by returning false prematurely > when !pmd_present or !pud_present or !p4d_present or !pgd_present: there > are cases when a huge page may span the boundary, with ptes present in > the next. Oh. My bad. I guess it was pain to debug. > Restructure page_vma_mapped_walk() as a loop to continue in these cases, > while keeping its layout much as before. Add a step_forward() helper to > advance pvmw->address across those boundaries: originally I tried to use > mm's standard p?d_addr_end() macros, but hit the same crash 512 times > less often: because of the way redundant levels are folded together, > but folded differently in different configurations, it was just too > difficult to use them correctly; and step_forward() is simpler anyway. > > Merged various other minor fixes and cleanups into page_vma_mapped_walk() > as I worked on it: which I find much easier to enumerate here than to > prise apart into separate commits. But it makes it harder to review... > Handle all of the hugetlbfs PageHuge() case once at the start, > so we don't need to worry about it again further down. > > Sometimes local copy of pvmw->page was used, sometimes pvmw->page: > just use pvmw->page throughout (and continue to use pvmw->address > throughout, though we could take a local copy instead). > > Use pmd_read_atomic() with barrier() instead of READ_ONCE() for pmde: > some architectures (e.g. i386 with PAE) have a multi-word pmd entry, > for which READ_ONCE() is not good enough. > > Re-evaluate pmde after taking lock, then use it in subsequent tests, > instead of repeatedly dereferencing pvmw->pmd pointer. > > Rearrange the !pmd_present block to follow the same "return not_found, > return not_found, return true" pattern as the block above it (note: > returning not_found there is never premature, since the existence or > prior existence of a huge pmd guarantees good alignment). > > Adjust page table boundary test in case address was not page-aligned. > > Reset pvmw->pte to NULL after unmapping that page table. > > Respect the 80-column line limit. > > Fixes: ace71a19cec5 ("mm: introduce page_vma_mapped_walk()") > Signed-off-by: Hugh Dickins <hughd@xxxxxxxxxx> > Cc: <stable@xxxxxxxxxxxxxxx> I tried to review it and superficially it looks good, but it has to be split into bunch of patches. > /* when pud is not present, pte will be NULL */ > - pvmw->pte = huge_pte_offset(mm, pvmw->address, page_size(page)); > + pvmw->pte = huge_pte_offset(mm, pvmw->address, > + page_size(pvmw->page)); AFAICS, it exactly fits into 80-column. > if (!pvmw->pte) > return false; > > - pvmw->ptl = huge_pte_lockptr(page_hstate(page), mm, pvmw->pte); > + pvmw->ptl = huge_pte_lockptr(page_hstate(pvmw->page), > + mm, pvmw->pte); And this one end on 79. Hm? -- Kirill A. Shutemov