Re: Folio mapcount

Zi Yan <ziy@xxxxxxxxxx> · Fri, 30 Jun 2023 21:17:19 -0400

On 29 Mar 2023, at 10:02, Yin, Fengwei wrote:

> Hi Matthew,
>
> On 2/9/2023 3:54 AM, Matthew Wilcox wrote:
>> On Wed, Feb 08, 2023 at 02:36:41PM -0500, Zi Yan wrote:
>>> On 7 Feb 2023, at 11:51, Matthew Wilcox wrote:
>>>
>>>> On Tue, Feb 07, 2023 at 11:23:31AM -0500, Zi Yan wrote:
>>>>> On 24 Jan 2023, at 13:13, Matthew Wilcox wrote:
>>>>>
>>>>>> Once we get to the part of the folio journey where we have
>>>>>> one-pointer-per-page, we can't afford to maintain per-page state.
>>>>>> Currently we maintain a per-page mapcount, and that will have to go.
>>>>>> We can maintain extra state for a multi-page folio, but it has to be a
>>>>>> constant amount of extra state no matter how many pages are in the folio.
>>>>>>
>>>>>> My proposal is that we maintain a single mapcount per folio, and its
>>>>>> definition is the number of (vma, page table) tuples which have a
>>>>>> reference to any pages in this folio.
>>>>>
>>>>> How about having two, full_folio_mapcount and partial_folio_mapcount?
>>>>> If partial_folio_mapcount is 0, we can have a fast path without doing
>>>>> anything at page level.
>>>>
>>>> A fast path for what?  I don't understand your vision; can you spell it
>>>> out for me?  My current proposal is here:
>>>
>>> A fast code path for only handling folios as a whole. For cases that
>>> subpages are mapped from a folio, traversing through subpages might be
>>> needed and will be slow. A code separation might be cleaner and makes
>>> folio as a whole handling quicker.
>>
>> To be clear, in this proposal, there is no subpage mapcount.  I've got
>> my eye on one struct folio per allocation, so there will be no more
>> tail pages.  The proposal has one mapcount, and that's it.  I'd be
>> open to saying "OK, we need two mapcounts", but not to anything that
>> needs to scale per number of pages in the folio.
>>
>>> For your proposal, "How many VMAs have one-or-more pages of this folio mapped"
>>> should be the responsibility of rmap. We could add a counter to rmap
>>> instead. It seems that you are mixing page table mapping with virtual
>>> address space (VMA) mapping together.
>>
>> rmap tells you how many VMAs cover this folio.  It doesn't tell you
>> how many of those VMAs have actually got any pages from it mapped.
>> It's also rather slower than a simple atomic_read(), so I think
>> you'll have an uphill battle trying to convince people to use rmap
>> for this purpose.
>>
>> I'm not sure what you mean by "add a counter to rmap"?  One count
>> per mapped page in the vma?
>>
>>>>
>>>> https://lore.kernel.org/linux-mm/Y+FkV4fBxHlp6FTH@xxxxxxxxxxxxxxxxxxxx/
>>>>
>>>> The three questions we need to be able to answer (in my current
>>>> understanding) are laid out here:
>>>>
>>>> https://lore.kernel.org/linux-mm/Y+HblAN5bM1uYD2f@xxxxxxxxxxxxxxxxxxxx/
>>>
>>> I think we probably need to clarify the definition of "map" in your
>>> questions. Does it mean mapped by page tables or VMAs? When a page
>>> is mapped into a VMA, it can be mapped by one or more page table entries,
>>> but not the other way around, right? Or is shared page table entry merged
>>> now so that more than one VMAs can use a single page table entry to map
>>> a folio?
>>
>> Mapped by page tables, just like today.  It'd be quite the change to
>> figure out the mapcount of a page newly brought into the page cache;
>> we'd have to do an rmap walk to see how many mapcounts to give it.
>> I don't think this is a great idea.
>>
>> As far as I know, shared page tables are only supported by hugetlbfs,
>> and I prefer to stick cheese in my ears and pretend they don't exist.
>>
>> To be absolutely concrete about this, my proposal is:
>>
>> Folio brought into page cache has mapcount 0 (whether or not there are any VMAs
>> that cover it)
>> When we take a page fault on one of the pages in it, its mapcount
>> increases from 0 to 1.
>> When we take another page fault on a page in it, we do a pvmw to
>> determine if any pages from this folio are already mapped by this VMA;
>> we see that there is one and we do not increment the mapcount.
>> We partially munmap() so that we need to unmap one of the pages.
>> We remove it from the page tables and call page_remove_rmap().
>> That does another pvmw and sees there's still a page in this folio
>> mapped by this VMA, does not decrement the refcount
>> We truncate() the file smaller than the position of the folio, which
>> causes us to unmap the rest of the folio.  The pvmw walk detects no
>> more pages from this folio mapped and we decrement the mapcount.
>>
>> Clear enough?
>
> I thought about this proposal for some time and would like to give it
> a try.
>
> I did a test about getting mapcount with pvmw walk vs folio_mapcount()
> call like:
> 1.
>   while (page_vma_mapped_walk(&pvmw)) {
>           mapcount++;
>   }
>
> 2.
>   mapcount = folio_mapcount(folio);
>
> The pvmw walk is 3X slower than folio_mapcount() call on a Ice Lake
> platform.
>
>
> Also noticed following thing when I read related code:
> 1. If it's entire folio is mapped to VMA, it's not necessary to do
>    pvmw walk. We can just increase mapcount (or decrease mapcount if
>    folio is unmapped from VMA).
>
> 2. The folio refcount update needs be changed to match mapcount
>    change. Otherwise, the #3 question in
>   https://lore.kernel.org/linux-mm/Y+HblAN5bM1uYD2f@xxxxxxxxxxxxxxxxxxxx/
>    can't be answered.
>
> 3. The meaning of lruvec stat of NR_FILE_MAPPED will be changed as
>    we don't track each page mapcount. This info is exposed to user space
>    through meminfo interface.
>
> 4. The new mapcount present how many VMAs the folio map to. So during
>    split_vma/merge_vma operation, we need to update the mapcount if the
>    split/merge happens in the middle of folio.
>
>    Consider following case:
>    A large folio with two cow pages in the middle of it.
>    |-----------------VMA---------------------------|
>        |---folio--|cow page1|cow page2|---folio|
>
>    And the split_vma happens between cow page1/page2
>    |----------VMA1----------| |-----------VMA2-----|
>        |---folio--|cow page1| |cow page2|---folio|
>                              | split_vma here
>
>    How do we detect we should update folio mapcount in this case?
>    Or I am just concerning the thing which is not possible to happen?

I also did some study on mapcount and tried to use a single mapcount
instead of existing various mapcounts. My conclusion is that from kernel
perspective, a single mapcount is enough, but we will need per-page
mapcount and entire_mapcount for userspace stats, NR_{ANON,FILE}_MAPPED,
and NR_ANON_THPS.

In kernel, almost all code only cares: 1) if a page/folio has extra pins
by checking if mapcount is equal to refcount + extra, and 2)
if a page/folio is mapped multiple times. A single mapcount can meet
these two needs.

But in userspace, to maintain the accuracy of NR_{ANON,FILE}_MAPPED,
and NR_ANON_THPS, kernel needs to know when the corresponding mapcount
goes from 0 to 1 (increase the counter) and 1 to 0 (decrease the counter).
For NR_{ANON,FILE}_MAPPED, it is increased when a page is first mapped
either by PTE or covered by PMD and decreased when a page loses its last
mapping from PTE or PMD. This means without per-page mapcount and
entire_mapcount, we cannot get them right. For NR_ANON_THPS, entire_mapcount
is needed. A single mapcount is a mix of per-page mapcount and
entire_mapcount and kernel is not able to recover the necessary
information for NR_*.

I wonder if userspace can live without these stats or different counters.
NR_ANON_MAPPED is "AnonPages", NR_FILE_MAPPED, is "Mapped" or "file",
NR_ANON_THPS is "AnonHugePages", "anon_thp". Can we just count anonymous
pages and file pages regardless they are mapped or not instead. Does
userspace really want to know the mapped pages? If that change can be done,
we probably can have a single mapcount.

BTW, I am not sure pvmw would work to check per-page or entire mapcounts,
since that means for every rmap removal, pvmw is needed to decide
whether to decrease NR_* counters. That seems to be expensive.

Let me know if I miss anything. Thanks.

--
Best Regards,
Yan, Zi
Attachment:
signature.asc

Description: OpenPGP digital signature