On Thu, Mar 7, 2024 at 7:13 PM Ryan Roberts <ryan.roberts@xxxxxxx> wrote: > > On 07/03/2024 10:54, David Hildenbrand wrote: > > On 07.03.24 11:54, David Hildenbrand wrote: > >> On 07.03.24 11:50, Ryan Roberts wrote: > >>> On 07/03/2024 09:33, Barry Song wrote: > >>>> On Thu, Mar 7, 2024 at 10:07 PM Ryan Roberts <ryan.roberts@xxxxxxx> wrote: > >>>>> > >>>>> On 07/03/2024 08:10, Barry Song wrote: > >>>>>> On Thu, Mar 7, 2024 at 9:00 PM Lance Yang <ioworker0@xxxxxxxxx> wrote: > >>>>>>> > >>>>>>> Hey Barry, > >>>>>>> > >>>>>>> Thanks for taking time to review! > >>>>>>> > >>>>>>> On Thu, Mar 7, 2024 at 3:00 PM Barry Song <21cnbao@xxxxxxxxx> wrote: > >>>>>>>> > >>>>>>>> On Thu, Mar 7, 2024 at 7:15 PM Lance Yang <ioworker0@xxxxxxxxx> wrote: > >>>>>>>>> > >>>>>>> [...] > >>>>>>>>> +static inline bool can_mark_large_folio_lazyfree(unsigned long addr, > >>>>>>>>> + struct folio *folio, > >>>>>>>>> pte_t *start_pte) > >>>>>>>>> +{ > >>>>>>>>> + int nr_pages = folio_nr_pages(folio); > >>>>>>>>> + fpb_t flags = FPB_IGNORE_DIRTY | FPB_IGNORE_SOFT_DIRTY; > >>>>>>>>> + > >>>>>>>>> + for (int i = 0; i < nr_pages; i++) > >>>>>>>>> + if (page_mapcount(folio_page(folio, i)) != 1) > >>>>>>>>> + return false; > >>>>>>>> > >>>>>>>> we have moved to folio_estimated_sharers though it is not precise, so > >>>>>>>> we don't do > >>>>>>>> this check with lots of loops and depending on the subpage's mapcount. > >>>>>>> > >>>>>>> If we don't check the subpage’s mapcount, and there is a cow folio > >>>>>>> associated > >>>>>>> with this folio and the cow folio has smaller size than this folio, > >>>>>>> should we still > >>>>>>> mark this folio as lazyfree? > >>>>>> > >>>>>> I agree, this is true. However, we've somehow accepted the fact that > >>>>>> folio_likely_mapped_shared > >>>>>> can result in false negatives or false positives to balance the > >>>>>> overhead. So I really don't know :-) > >>>>>> > >>>>>> Maybe David and Vishal can give some comments here. > >>>>>> > >>>>>>> > >>>>>>>> BTW, do we need to rebase our work against David's changes[1]? > >>>>>>>> [1] > >>>>>>>> https://lore.kernel.org/linux-mm/20240227201548.857831-1-david@xxxxxxxxxx/ > >>>>>>> > >>>>>>> Yes, we should rebase our work against David’s changes. > >>>>>>> > >>>>>>>> > >>>>>>>>> + > >>>>>>>>> + return nr_pages == folio_pte_batch(folio, addr, start_pte, > >>>>>>>>> + ptep_get(start_pte), nr_pages, > >>>>>>>>> flags, NULL); > >>>>>>>>> +} > >>>>>>>>> + > >>>>>>>>> static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr, > >>>>>>>>> unsigned long end, struct mm_walk *walk) > >>>>>>>>> > >>>>>>>>> @@ -676,11 +690,45 @@ static int madvise_free_pte_range(pmd_t *pmd, > >>>>>>>>> unsigned long addr, > >>>>>>>>> */ > >>>>>>>>> if (folio_test_large(folio)) { > >>>>>>>>> int err; > >>>>>>>>> + unsigned long next_addr, align; > >>>>>>>>> > >>>>>>>>> - if (folio_estimated_sharers(folio) != 1) > >>>>>>>>> - break; > >>>>>>>>> - if (!folio_trylock(folio)) > >>>>>>>>> - break; > >>>>>>>>> + if (folio_estimated_sharers(folio) != 1 || > >>>>>>>>> + !folio_trylock(folio)) > >>>>>>>>> + goto skip_large_folio; > >>>>>>>> > >>>>>>>> > >>>>>>>> I don't think we can skip all the PTEs for nr_pages, as some of them > >>>>>>>> might be > >>>>>>>> pointing to other folios. > >>>>>>>> > >>>>>>>> for example, for a large folio with 16PTEs, you do MADV_DONTNEED(15-16), > >>>>>>>> and write the memory of PTE15 and PTE16, you get page faults, thus PTE15 > >>>>>>>> and PTE16 will point to two different small folios. We can only skip > >>>>>>>> when we > >>>>>>>> are sure nr_pages == folio_pte_batch() is sure. > >>>>>>> > >>>>>>> Agreed. Thanks for pointing that out. > >>>>>>> > >>>>>>>> > >>>>>>>>> + > >>>>>>>>> + align = folio_nr_pages(folio) * PAGE_SIZE; > >>>>>>>>> + next_addr = ALIGN_DOWN(addr + align, align); > >>>>>>>>> + > >>>>>>>>> + /* > >>>>>>>>> + * If we mark only the subpages as lazyfree, or > >>>>>>>>> + * cannot mark the entire large folio as lazyfree, > >>>>>>>>> + * then just split it. > >>>>>>>>> + */ > >>>>>>>>> + if (next_addr > end || next_addr - addr != > >>>>>>>>> align || > >>>>>>>>> + !can_mark_large_folio_lazyfree(addr, folio, > >>>>>>>>> pte)) > >>>>>>>>> + goto split_large_folio; > >>>>>>>>> + > >>>>>>>>> + /* > >>>>>>>>> + * Avoid unnecessary folio splitting if the large > >>>>>>>>> + * folio is entirely within the given range. > >>>>>>>>> + */ > >>>>>>>>> + folio_clear_dirty(folio); > >>>>>>>>> + folio_unlock(folio); > >>>>>>>>> + for (; addr != next_addr; pte++, addr += > >>>>>>>>> PAGE_SIZE) { > >>>>>>>>> + ptent = ptep_get(pte); > >>>>>>>>> + if (pte_young(ptent) || > >>>>>>>>> pte_dirty(ptent)) { > >>>>>>>>> + ptent = ptep_get_and_clear_full( > >>>>>>>>> + mm, addr, pte, > >>>>>>>>> tlb->fullmm); > >>>>>>>>> + ptent = pte_mkold(ptent); > >>>>>>>>> + ptent = pte_mkclean(ptent); > >>>>>>>>> + set_pte_at(mm, addr, pte, ptent); > >>>>>>>>> + tlb_remove_tlb_entry(tlb, pte, > >>>>>>>>> addr); > >>>>>>>>> + } > >>>>>>>> > >>>>>>>> Can we do this in batches? for a CONT-PTE mapped large folio, you are > >>>>>>>> unfolding > >>>>>>>> and folding again. It seems quite expensive. > >>>>> > >>>>> I'm not convinced we should be doing this in batches. We want the initial > >>>>> folio_pte_batch() to be as loose as possible regarding permissions so that we > >>>>> reduce our chances of splitting folios to the min. (e.g. ignore SW bits like > >>>>> soft dirty, etc). I think it might be possible that some PTEs are RO and other > >>>>> RW too (e.g. due to cow - although with the current cow impl, probably not. > >>>>> But > >>>>> its fragile to assume that). Anyway, if we do an initial batch that ignores > >>>>> all > >>>> > >>>> You are correct. I believe this scenario could indeed occur. For instance, > >>>> if process A forks process B and then unmaps itself, leaving B as the > >>>> sole process owning the large folio. The current wp_page_reuse() function > >>>> will reuse PTE one by one while the specific subpage is written. > >>> > >>> Hmm - I thought it would only reuse if the total mapcount for the folio was 1. > >>> And since it is a large folio with each page mapped once in proc B, I thought > >>> every subpage write would cause a copy except the last one? I haven't looked at > >>> the code for a while. But I had it in my head that this is an area we need to > >>> improve for mTHP. So sad I am wrong again 😢 > >> > >> wp_page_reuse() will currently reuse a PTE part of a large folio only if > >> a single PTE remains mapped (refcount == 0). > > > > ^ == 1 seems this needs improvement. it is a waste the last subpage can reuse the whole large folio. i was doing it in a quite different way, if the large folio had only one subpage left, i would do copy and released the large folio[1]. and if i could reuse the whole large folio with CONT-PTE, i would reuse the whole large folio[2]. in mainline, we don't have this cont-pte luxury exposed to mm, so i guess we can not do [2] easily, but [1] seems to be an optimization. [1] https://github.com/OnePlusOSS/android_kernel_oneplus_sm8650/blob/oneplus/sm8650_u_14.0.0_oneplus12/mm/memory.c#L3977 [2] https://github.com/OnePlusOSS/android_kernel_oneplus_sm8650/blob/oneplus/sm8650_u_14.0.0_oneplus12/mm/memory.c#L3812 > > Ahh yes. That's what I meant. I got the behacviour vagulely right though. > > Anyway, regardless, I'm not sure we want to batch here. Or if we do, we want to > batch function that will only clear access and dirty. > Thanks Barry