On Tue, Mar 5, 2024 at 1:41 AM David Hildenbrand <david@xxxxxxxxxx> wrote: > > On 04.03.24 13:20, Ryan Roberts wrote: > > Hi Barry, > > > > On 04/03/2024 10:37, Barry Song wrote: > >> From: Barry Song <v-songbaohua@xxxxxxxx> > >> > >> page_vma_mapped_walk() within try_to_unmap_one() races with other > >> PTEs modification such as break-before-make, while iterating PTEs > >> of a large folio, it will only begin to acquire PTL after it gets > >> a valid(present) PTE. break-before-make intermediately sets PTEs > >> to pte_none. Thus, a large folio's PTEs might be partially skipped > >> in try_to_unmap_one(). > > > > I just want to check my understanding here - I think the problem occurs for > > PTE-mapped, PMD-sized folios as well as smaller-than-PMD-size large folios? Now > > that I've had a look at the code and have a better understanding, I think that > > must be the case? And therefore this problem exists independently of my work to > > support swap-out of mTHP? (From your previous report I was under the impression > > that it only affected mTHP). > > > > Its just that the problem is becoming more pronounced because with mTHP, > > PTE-mapped large folios are much more common? > > That is my understanding. > > > > >> For example, for an anon folio, after try_to_unmap_one(), we may > >> have PTE0 present, while PTE1 ~ PTE(nr_pages - 1) are swap entries. > >> So folio will be still mapped, the folio fails to be reclaimed. > >> What’s even more worrying is, its PTEs are no longer in a unified > >> state. This might lead to accident folio_split() afterwards. And > >> since a part of PTEs are now swap entries, accessing them will > >> incur page fault - do_swap_page. > >> It creates both anxiety and more expense. While we can't avoid > >> userspace's unmap to break up unified PTEs such as CONT-PTE for > >> a large folio, we can indeed keep away from kernel's breaking up > >> them due to its code design. > >> This patch is holding PTL from PTE0, thus, the folio will either > >> be entirely reclaimed or entirely kept. On the other hand, this > >> approach doesn't increase PTL contention. Even w/o the patch, > >> page_vma_mapped_walk() will always get PTL after it sometimes > >> skips one or two PTEs because intermediate break-before-makes > >> are short, according to test. Of course, even w/o this patch, > >> the vast majority of try_to_unmap_one still can get PTL from > >> PTE0. This patch makes the number 100%. > >> The other option is that we can give up in try_to_unmap_one > >> once we find PTE0 is not the first entry we get PTL, we call > >> page_vma_mapped_walk_done() to end the iteration at this case. > >> This will keep the unified PTEs while the folio isn't reclaimed. > >> The result is quite similar with small folios with one PTE - > >> either entirely reclaimed or entirely kept. > >> Reclaiming large folios by holding PTL from PTE0 seems a better > >> option comparing to giving up after detecting PTL begins from > >> non-PTE0. > >> > > I'm sure that wall of text can be formatted in a better way :) . Also, I > think we can drop some of the details, > > If you need some inspiration, I can give it a shot. > > >> Cc: Hugh Dickins <hughd@xxxxxxxxxx> > >> Signed-off-by: Barry Song <v-songbaohua@xxxxxxxx> > > > > Do we need a Fixes tag? I am not quite sure which commit should be here for a fixes tag. I think it's more of an optimization. > > > > What would be the description of the problem we are fixing? > > 1) failing to unmap? > > That can happen with small folios as well IIUC. > > 2) Putting the large folio on the deferred split queue? > > That sounds more reasonable. I don't feel it is reasonable. Avoiding this kind of accident splitting from the kernel's improper code is a more reasonable approach as there is always a price to pay for splitting and unfolding PTEs etc. While we can't avoid splitting coming from userspace's MADV_DONTNEED, munmap, mprotect, we have a way to ensure the kernel itself doesn't accidently break up a large folio. In OPPO's phones, we ran into some weird bugs due to skipped PTEs in try_to_unmap_one. hardly could we fix it from the root cause. with various races, figuring out their timings was really a big pain :-) But we did "resolve" those bugs by entirely untouching all PTEs if we found some PTEs were skipped in try_to_unmap_one [1]. While we find we only get the PTL from 2nd, 3rd but not 1st PTE, we entirely give up on try_to_unmap_one, and leave all PTEs untouched. /* we are not starting from head */ if (!IS_ALIGNED((unsigned long)pvmw.pte, CONT_PTES * sizeof(*pvmw.pte))) { ret = false; atomic64_inc(&perf_stat.mapped_walk_start_from_non_head); set_pte_at(mm, address, pvmw.pte, pteval); page_vma_mapped_walk_done(&pvmw); break; } This will ensure all PTEs still have a unified state such as CONT-PTE after try_to_unmap fails. I feel this could have some false postive because when racing with unmap, 1st PTE might really become pte_none. So explicitly holding PTL from 1st PTE seems a better way. [1] https://github.com/OnePlusOSS/android_kernel_oneplus_sm8550/blob/oneplus/sm8550_u_14.0.0_oneplus11/mm/rmap.c#L1730 > > >> --- > >> mm/vmscan.c | 11 +++++++++++ > >> 1 file changed, 11 insertions(+) > >> > >> diff --git a/mm/vmscan.c b/mm/vmscan.c > >> index 0b888a2afa58..e4722fbbcd0c 100644 > >> --- a/mm/vmscan.c > >> +++ b/mm/vmscan.c > >> @@ -1270,6 +1270,17 @@ static unsigned int shrink_folio_list(struct list_head *folio_list, > >> > >> if (folio_test_pmd_mappable(folio)) > >> flags |= TTU_SPLIT_HUGE_PMD; > >> + /* > >> + * if page table lock is not held from the first PTE of > >> + * a large folio, some PTEs might be skipped because of > >> + * races with break-before-make, for example, PTEs can > >> + * be pte_none intermediately, thus one or more PTEs > >> + * might be skipped in try_to_unmap_one, we might result > >> + * in a large folio is partially mapped and partially > >> + * unmapped after try_to_unmap > >> + */ > >> + if (folio_test_large(folio)) > >> + flags |= TTU_SYNC; > > > > This looks sensible to me after thinking about it for a while. But I also have a > > gut feeling that there might be some more subtleties that are going over my > > head, since I'm not expert in this area. So will leave others to provide R-b :) > > > > As we are seeing more such problems with lockless PT walks, maybe we > really want some other special value (nonswap entry?) to indicate that a > PTE this is currently ondergoing protection changes. So we'd avoid the > pte_none() temporarily, if possible. > > Without that, TTU_SYNC feels like the right thing to do. > > -- > Cheers, > > David / dhildenb > Thanks Barry