On 04/03/2024 21:57, Barry Song wrote: > On Tue, Mar 5, 2024 at 1:21 AM Ryan Roberts <ryan.roberts@xxxxxxx> wrote: >> >> Hi Barry, >> >> On 04/03/2024 10:37, Barry Song wrote: >>> From: Barry Song <v-songbaohua@xxxxxxxx> >>> >>> page_vma_mapped_walk() within try_to_unmap_one() races with other >>> PTEs modification such as break-before-make, while iterating PTEs >>> of a large folio, it will only begin to acquire PTL after it gets >>> a valid(present) PTE. break-before-make intermediately sets PTEs >>> to pte_none. Thus, a large folio's PTEs might be partially skipped >>> in try_to_unmap_one(). >> >> I just want to check my understanding here - I think the problem occurs for >> PTE-mapped, PMD-sized folios as well as smaller-than-PMD-size large folios? Now >> that I've had a look at the code and have a better understanding, I think that >> must be the case? And therefore this problem exists independently of my work to >> support swap-out of mTHP? (From your previous report I was under the impression >> that it only affected mTHP). > > I think this affects all large folios with PTEs entries more than 1. but hugeTLB > is handled as a whole in try_to_unmap_one and its rmap is removed all > together, i feel hugeTLB doesn't have this problem. > >> >> Its just that the problem is becoming more pronounced because with mTHP, >> PTE-mapped large folios are much more common? > > right. as now large folios become a more common case, and it is my case > running in millions of phones. > > BTW, I feel we can somehow learn from hugeTLB, for example, we can reclaim > all PTEs all together rather than iterating PTEs one by one. This will improve > performance. for example, a batched > set_ptes_to_swap_entries() > { > } > then we only need to loop once for a large folio, right now we are looping > nr_pages times. You still need a pte-pte loop somewhere. In hugetlb's case it's in the arch implementation. HugeTLB ptes are all a fixed size for a given VMA, which makes things a bit easier too, whereas in the regular mm, they are now a variable size. David and I introduced folio_pte_batch() to help gather batches of ptes, and it uses the contpte bit to avoid iterating over intermediate ptes. And I'm adding swap_pte_batch() which does a similar thing for swap entry batching in v4 of my swap-out series. For your set_ptes_to_swap_entries() example, I'm not sure what it would do other than loop over the PTEs setting an incremented swap entry to each one? How is that more performant? Thanks, Ryan