Hi David, On 8/4/2023 3:31 PM, David Hildenbrand wrote: > On 04.08.23 02:17, Yin, Fengwei wrote: >> >> >> On 8/4/2023 7:38 AM, Yu Zhao wrote: >>> On Thu, Aug 3, 2023 at 5:27 PM Yin, Fengwei <fengwei.yin@xxxxxxxxx> wrote: >>>> >>>> >>>> >>>> On 8/4/2023 4:46 AM, Yu Zhao wrote: >>>>> On Wed, Aug 2, 2023 at 6:56 AM Yin, Fengwei <fengwei.yin@xxxxxxxxx> wrote: >>>>>> >>>>>> " >>>>>> >>>>>> On 8/2/2023 8:49 PM, Ryan Roberts wrote: >>>>>>> On 02/08/2023 13:42, Yin, Fengwei wrote: >>>>>>>> >>>>>>>> >>>>>>>> On 8/2/2023 8:40 PM, Ryan Roberts wrote: >>>>>>>>> On 02/08/2023 13:35, Yin, Fengwei wrote: >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On 8/2/2023 6:27 PM, Ryan Roberts wrote: >>>>>>>>>>> On 28/07/2023 17:13, Yin Fengwei wrote: >>>>>>>>>>>> In madvise_cold_or_pageout_pte_range() and madvise_free_pte_range(), >>>>>>>>>>>> folio_mapcount() is used to check whether the folio is shared. But it's >>>>>>>>>>>> not correct as folio_mapcount() returns total mapcount of large folio. >>>>>>>>>>>> >>>>>>>>>>>> Use folio_estimated_sharers() here as the estimated number is enough. >>>>>>>>>>>> >>>>>>>>>>>> Yin Fengwei (2): >>>>>>>>>>>> madvise: don't use mapcount() against large folio for sharing check >>>>>>>>>>>> madvise: don't use mapcount() against large folio for sharing check >>>>>>>>>>>> >>>>>>>>>>>> mm/huge_memory.c | 2 +- >>>>>>>>>>>> mm/madvise.c | 6 +++--- >>>>>>>>>>>> 2 files changed, 4 insertions(+), 4 deletions(-) >>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> As a set of fixes, I agree this is definitely an improvement, so: >>>>>>>>>>> >>>>>>>>>>> Reviewed-By: Ryan Roberts >>>>>>>>>> Thanks. >>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> But I have a couple of comments around further improvements; >>>>>>>>>>> >>>>>>>>>>> Once we have the scheme that David is working on to be able to provide precise >>>>>>>>>>> exclusive vs shared info, we will probably want to move to that. Although that >>>>>>>>>>> scheme will need access to the mm_struct of a process known to be mapping the >>>>>>>>>>> folio. We have that info, but its not passed to folio_estimated_sharers() so we >>>>>>>>>>> can't just reimplement folio_estimated_sharers() - we will need to rework these >>>>>>>>>>> call sites again. >>>>>>>>>> Yes. This could be extra work. Maybe should delay till David's work is done. >>>>>>>>> >>>>>>>>> What you have is definitely an improvement over what was there before. And is >>>>>>>>> probably the best we can do without David's scheme. So I wouldn't delay this. >>>>>>>>> Just pointing out that we will be able to make it even better later on (if >>>>>>>>> David's stuff goes in). >>>>>>>> Yes. I agree that we should wait for David's work ready and do fix based on that. >>>>>>> >>>>>>> I was suggesting the opposite - not waiting. Then we can do separate improvement >>>>>>> later. >>>>>> Let's wait for David's work ready. >>>>> >>>>> Waiting is fine as long as we don't miss the next merge window -- we >>>>> don't want these two bugs to get into another release. Also I think we >>>>> should cc stable, since as David mentioned, they have been causing >>>>> selftest failures. >>>> >>>> Stable was CCed. >>> >>> Need to add the "Cc: stable@xxxxxxxxxxxxxxx" tag: >>> Documentation/process/stable-kernel-rules.rst >> OK. Thanks for clarification. I totally mis-understanded this. :). >> >> I'd like to wait for answer from Andrew whether these patches are suitable >> for stable (I suppose you think so) branch. > > Note that the COW test does not fail -- it skips -- but the behavir changed: > > $ ./cow > # [INFO] detected THP size: 2048 KiB > # [INFO] detected hugetlb page size: 2048 KiB > # [INFO] detected hugetlb page size: 1048576 KiB > # [INFO] huge zeropage is enabled > TAP version 13 > 1..190 > # [INFO] Anonymous memory tests in private mappings > # [RUN] Basic COW after fork() ... with base page > ok 1 No leak from parent into child > # [RUN] Basic COW after fork() ... with swapped out base page > ok 2 No leak from parent into child > # [RUN] Basic COW after fork() ... with THP > ok 3 No leak from parent into child > # [RUN] Basic COW after fork() ... with swapped-out THP > ok 4 No leak from parent into child > # [RUN] Basic COW after fork() ... with PTE-mapped THP > ok 5 No leak from parent into child > # [RUN] Basic COW after fork() ... with swapped-out, PTE-mapped THP > ok 6 # SKIP MADV_PAGEOUT did not work, is swap enabled? > # [RUN] Basic COW after fork() ... with single PTE of THP > ok 7 No leak from parent into child > # [RUN] Basic COW after fork() ... with single PTE of swapped-out THP > ok 8 No leak from parent into child > # [RUN] Basic COW after fork() ... with partially mremap()'ed THP > ok 9 No leak from parent into child > # [RUN] Basic COW after fork() ... with partially shared THP > ok 10 No leak from parent into child > ... > > Observe how patch #6 skips because the MADV_PAGEOUT was not effective (which might have happened due to other reasons as well, thus no failure). > > The code that broke it is > > commit 07e8c82b5eff8ef34b74210eacb8d9c4a2886b82 > Author: Vishal Moola (Oracle) <vishal.moola@xxxxxxxxx> > Date: Wed Dec 21 10:08:46 2022 -0800 > > madvise: convert madvise_cold_or_pageout_pte_range() to use folios > This change removes a number of calls to compound_head(), and saves > 1729 bytes of kernel text. > Link: https://lkml.kernel.org/r/20221221180848.20774-3-vishal.moola@xxxxxxxxx > Signed-off-by: Vishal Moola (Oracle) <vishal.moola@xxxxxxxxx> > Reviewed-by: Matthew Wilcox (Oracle) <willy@xxxxxxxxxxxxx> > Cc: SeongJae Park <sj@xxxxxxxxxx> > Signed-off-by: Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx> > > > Ever since v6.3. > > The simplest way to fix it would be to revert the page_mapcount() -> folio_mapcount(), > conversion. > > > Probably all that is information worth having in the patch description. Thanks a lot for checking this. I will try this patchset to see whether it can restore the behavior (I suppose so from your broken commit info but want to confirm). Regards Yin, Fengwei