Barry Song <21cnbao@xxxxxxxxx> writes: > On Tue, Mar 5, 2024 at 8:30 PM Huang, Ying <ying.huang@xxxxxxxxx> wrote: >> >> Barry Song <21cnbao@xxxxxxxxx> writes: >> >> > From: Barry Song <v-songbaohua@xxxxxxxx> >> > >> > page_vma_mapped_walk() within try_to_unmap_one() races with other >> > PTEs modification such as break-before-make, while iterating PTEs >> >> Sorry, I don't know what is "break-before-make", can you elaborate? >> IIUC, ptep_modify_prot_start()/ptep_modify_prot_commit() can clear PTE >> temporarily, which may cause race with page_vma_mapped_walk(). Is that >> the issue that you try to fix? > > we are writing pte to zero(break) before writing a new value(make). OK. Is break and make is commonly used terminology in kernel? If not, it's better to explain a little (e.g., ptep_get_and_clear() / modify / set_pte_at()). > while > this behavior is within PTL in another thread, page_vma_mapped_walk() > of try_to_unmap_one thread won't take PTL till it meets a present PTE. IIUC, !pte_none() should be enough? > for example, if another threads are modifying nr_pages PTEs under PTL, > but we don't hold PTL, we might skip one or two PTEs at the beginning of > a large folio. > For a large folio, after try_to_unmap_one(), we may result in PTE0 and PTE1 > untouched but PTE2~nr_pages-1 are set to swap entries. > > by holding PTL from PTE0 for large folios, we won't get these intermediate > values. At the moment we get PTL, other threads have done. Got it! Thanks! -- Best Regards, Huang, Ying >> >> -- >> Best Regards, >> Huang, Ying >> >> > of a large folio, it will only begin to acquire PTL after it gets >> > a valid(present) PTE. break-before-make intermediately sets PTEs >> > to pte_none. Thus, a large folio's PTEs might be partially skipped >> > in try_to_unmap_one(). >> > For example, for an anon folio, after try_to_unmap_one(), we may >> > have PTE0 present, while PTE1 ~ PTE(nr_pages - 1) are swap entries. >> > So folio will be still mapped, the folio fails to be reclaimed. >> > What’s even more worrying is, its PTEs are no longer in a unified >> > state. This might lead to accident folio_split() afterwards. And >> > since a part of PTEs are now swap entries, accessing them will >> > incur page fault - do_swap_page. >> > It creates both anxiety and more expense. While we can't avoid >> > userspace's unmap to break up unified PTEs such as CONT-PTE for >> > a large folio, we can indeed keep away from kernel's breaking up >> > them due to its code design. >> > This patch is holding PTL from PTE0, thus, the folio will either >> > be entirely reclaimed or entirely kept. On the other hand, this >> > approach doesn't increase PTL contention. Even w/o the patch, >> > page_vma_mapped_walk() will always get PTL after it sometimes >> > skips one or two PTEs because intermediate break-before-makes >> > are short, according to test. Of course, even w/o this patch, >> > the vast majority of try_to_unmap_one still can get PTL from >> > PTE0. This patch makes the number 100%. >> > The other option is that we can give up in try_to_unmap_one >> > once we find PTE0 is not the first entry we get PTL, we call >> > page_vma_mapped_walk_done() to end the iteration at this case. >> > This will keep the unified PTEs while the folio isn't reclaimed. >> > The result is quite similar with small folios with one PTE - >> > either entirely reclaimed or entirely kept. >> > Reclaiming large folios by holding PTL from PTE0 seems a better >> > option comparing to giving up after detecting PTL begins from >> > non-PTE0. >> > >> > Cc: Hugh Dickins <hughd@xxxxxxxxxx> >> > Signed-off-by: Barry Song <v-songbaohua@xxxxxxxx> >> > --- >> > mm/vmscan.c | 11 +++++++++++ >> > 1 file changed, 11 insertions(+) >> > >> > diff --git a/mm/vmscan.c b/mm/vmscan.c >> > index 0b888a2afa58..e4722fbbcd0c 100644 >> > --- a/mm/vmscan.c >> > +++ b/mm/vmscan.c >> > @@ -1270,6 +1270,17 @@ static unsigned int shrink_folio_list(struct list_head *folio_list, >> > >> > if (folio_test_pmd_mappable(folio)) >> > flags |= TTU_SPLIT_HUGE_PMD; >> > + /* >> > + * if page table lock is not held from the first PTE of >> > + * a large folio, some PTEs might be skipped because of >> > + * races with break-before-make, for example, PTEs can >> > + * be pte_none intermediately, thus one or more PTEs >> > + * might be skipped in try_to_unmap_one, we might result >> > + * in a large folio is partially mapped and partially >> > + * unmapped after try_to_unmap >> > + */ >> > + if (folio_test_large(folio)) >> > + flags |= TTU_SYNC; >> > >> > try_to_unmap(folio, flags); >> > if (folio_mapped(folio)) { > > Thanks > Barry