Re: Folio mapcount

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi Matthew,

On 2/9/2023 3:54 AM, Matthew Wilcox wrote:
> On Wed, Feb 08, 2023 at 02:36:41PM -0500, Zi Yan wrote:
>> On 7 Feb 2023, at 11:51, Matthew Wilcox wrote:
>>
>>> On Tue, Feb 07, 2023 at 11:23:31AM -0500, Zi Yan wrote:
>>>> On 24 Jan 2023, at 13:13, Matthew Wilcox wrote:
>>>>
>>>>> Once we get to the part of the folio journey where we have
>>>>> one-pointer-per-page, we can't afford to maintain per-page state.
>>>>> Currently we maintain a per-page mapcount, and that will have to go.
>>>>> We can maintain extra state for a multi-page folio, but it has to be a
>>>>> constant amount of extra state no matter how many pages are in the folio.
>>>>>
>>>>> My proposal is that we maintain a single mapcount per folio, and its
>>>>> definition is the number of (vma, page table) tuples which have a
>>>>> reference to any pages in this folio.
>>>>
>>>> How about having two, full_folio_mapcount and partial_folio_mapcount?
>>>> If partial_folio_mapcount is 0, we can have a fast path without doing
>>>> anything at page level.
>>>
>>> A fast path for what?  I don't understand your vision; can you spell it
>>> out for me?  My current proposal is here:
>>
>> A fast code path for only handling folios as a whole. For cases that
>> subpages are mapped from a folio, traversing through subpages might be
>> needed and will be slow. A code separation might be cleaner and makes
>> folio as a whole handling quicker.
> 
> To be clear, in this proposal, there is no subpage mapcount.  I've got
> my eye on one struct folio per allocation, so there will be no more
> tail pages.  The proposal has one mapcount, and that's it.  I'd be
> open to saying "OK, we need two mapcounts", but not to anything that
> needs to scale per number of pages in the folio.
> 
>> For your proposal, "How many VMAs have one-or-more pages of this folio mapped"
>> should be the responsibility of rmap. We could add a counter to rmap
>> instead. It seems that you are mixing page table mapping with virtual
>> address space (VMA) mapping together.
> 
> rmap tells you how many VMAs cover this folio.  It doesn't tell you
> how many of those VMAs have actually got any pages from it mapped.
> It's also rather slower than a simple atomic_read(), so I think
> you'll have an uphill battle trying to convince people to use rmap
> for this purpose.
> 
> I'm not sure what you mean by "add a counter to rmap"?  One count
> per mapped page in the vma?
> 
>>>
>>> https://lore.kernel.org/linux-mm/Y+FkV4fBxHlp6FTH@xxxxxxxxxxxxxxxxxxxx/
>>>
>>> The three questions we need to be able to answer (in my current
>>> understanding) are laid out here:
>>>
>>> https://lore.kernel.org/linux-mm/Y+HblAN5bM1uYD2f@xxxxxxxxxxxxxxxxxxxx/
>>
>> I think we probably need to clarify the definition of "map" in your
>> questions. Does it mean mapped by page tables or VMAs? When a page
>> is mapped into a VMA, it can be mapped by one or more page table entries,
>> but not the other way around, right? Or is shared page table entry merged
>> now so that more than one VMAs can use a single page table entry to map
>> a folio?
> 
> Mapped by page tables, just like today.  It'd be quite the change to
> figure out the mapcount of a page newly brought into the page cache;
> we'd have to do an rmap walk to see how many mapcounts to give it.
> I don't think this is a great idea.
> 
> As far as I know, shared page tables are only supported by hugetlbfs,
> and I prefer to stick cheese in my ears and pretend they don't exist.
> 
> To be absolutely concrete about this, my proposal is:
> 
> Folio brought into page cache has mapcount 0 (whether or not there are any VMAs
> that cover it)
> When we take a page fault on one of the pages in it, its mapcount
> increases from 0 to 1.
> When we take another page fault on a page in it, we do a pvmw to
> determine if any pages from this folio are already mapped by this VMA;
> we see that there is one and we do not increment the mapcount.
> We partially munmap() so that we need to unmap one of the pages.
> We remove it from the page tables and call page_remove_rmap().
> That does another pvmw and sees there's still a page in this folio
> mapped by this VMA, does not decrement the refcount
> We truncate() the file smaller than the position of the folio, which
> causes us to unmap the rest of the folio.  The pvmw walk detects no
> more pages from this folio mapped and we decrement the mapcount.
> 
> Clear enough?

I thought about this proposal for some time and would like to give it
a try.

I did a test about getting mapcount with pvmw walk vs folio_mapcount()
call like:
1. 
  while (page_vma_mapped_walk(&pvmw)) {
          mapcount++;
  }

2.
  mapcount = folio_mapcount(folio);

The pvmw walk is 3X slower than folio_mapcount() call on a Ice Lake
platform.


Also noticed following thing when I read related code:
1. If it's entire folio is mapped to VMA, it's not necessary to do
   pvmw walk. We can just increase mapcount (or decrease mapcount if
   folio is unmapped from VMA).

2. The folio refcount update needs be changed to match mapcount
   change. Otherwise, the #3 question in
   https://lore.kernel.org/linux-mm/Y+HblAN5bM1uYD2f@xxxxxxxxxxxxxxxxxxxx/
   can't be answered.

3. The meaning of lruvec stat of NR_FILE_MAPPED will be changed as
   we don't track each page mapcount. This info is exposed to user space
   through meminfo interface.

4. The new mapcount present how many VMAs the folio map to. So during
   split_vma/merge_vma operation, we need to update the mapcount if the
   split/merge happens in the middle of folio.

   Consider following case:
   A large folio with two cow pages in the middle of it.
   |-----------------VMA---------------------------|
       |---folio--|cow page1|cow page2|---folio|

   And the split_vma happens between cow page1/page2
   |----------VMA1----------| |-----------VMA2-----|
       |---folio--|cow page1| |cow page2|---folio|
                             | split_vma here

   How do we detect we should update folio mapcount in this case?
   Or I am just concerning the thing which is not possible to happen?


Regards
Yin, Fengwei




[Index of Archives]     [Linux ARM Kernel]     [Linux ARM]     [Linux Omap]     [Fedora ARM]     [IETF Annouce]     [Bugtraq]     [Linux OMAP]     [Linux MIPS]     [eCos]     [Asterisk Internet PBX]     [Linux API]

  Powered by Linux