Re: Prerequisites for Large Anon Folios

"Yin, Fengwei" <fengwei.yin@xxxxxxxxx> · Sun, 23 Jul 2023 20:33:51 +0800



On 7/20/2023 5:41 PM, Ryan Roberts wrote:
> Hi All,
> 
> As discussed at Matthew's call yesterday evening, I've put together a list of
> items that need to be done as prerequisites for merging large anonymous folios
> support.
> 
> It would be great to get some review and confirmation as to whether anything is
> missing or incorrect. Most items have an assignee - in that case it would be
> good to check that my understanding that you are working on the item is correct.
> 
> I think most things are independent, with the exception of "shared vs exclusive
> mappings", which I think becomes a dependency for a couple of things (marked in
> depender description); again would be good to confirm.
> 
> Finally, although I'm concentrating on the prerequisites to clear the path for
> merging an MVP Large Anon Folios implementation, I've included one "enhancement"
> item ("large folios in swap cache"), solely because we explicitly discussed it
> last night. My view is that enhancements can come after the initial large anon
> folios merge. Over time, I plan to add other enhancements (e.g. retain large
> folios over COW, etc).
> 
> I'm posting the table as yaml as that seemed easiest for email. You can convert
> to csv with something like this in Python:
> 
>   import yaml
>   import pandas as pd
>   pd.DataFrame(yaml.safe_load(open('work-items.yml'))).to_csv('work-items.csv')
> 
> Thanks,
> Ryan
Should we add the mremap case to the list? Like how to handle the case that mremap
happens in the middle of large anonymous folio and fails to split it.


Regards
Yin, Fengwei

> 
> -----
> 
> - item:
>     shared vs exclusive mappings
> 
>   priority:
>     prerequisite
> 
>   description: >-
>     New mechanism to allow us to easily determine precisely whether a given
>     folio is mapped exclusively or shared between multiple processes. Required
>     for (from David H):
> 
>     (1) Detecting shared folios, to not mess with them while they are shared.
>     MADV_PAGEOUT, user-triggered page migration, NUMA hinting, khugepaged ...
>     replace cases where folio_estimated_sharers() == 1 would currently be the
>     best we can do (and in some cases, page_mapcount() == 1).
> 
>     (2) COW improvements for PTE-mapped large anon folios after fork(). Before
>     fork(), PageAnonExclusive would have been reliable, after fork() it's not.
> 
>     For (1), "MADV_PAGEOUT" maps to the "madvise" item captured in this list. I
>     *think* "NUMA hinting" maps to "numa balancing" (but need confirmation!).
>     "user-triggered page migration" and "khugepaged" not yet captured (would
>     appreciate someone fleshing it out). I previously understood migration to be
>     working for large folios - is "user-triggered page migration" some specific
>     aspect that does not work?
> 
>     For (2), this relates to Large Anon Folio enhancements which I plan to
>     tackle after we get the basic series merged.
> 
>   links:
>     - 'email thread: Mapcount games: "exclusive mapped" vs. "mapped shared"'
> 
>   location:
>     - shrink_folio_list()
> 
>   assignee:
>     David Hildenbrand <david@xxxxxxxxxx>
> 
> 
> 
> - item:
>     compaction
> 
>   priority:
>     prerequisite
> 
>   description: >-
>     Raised at LSFMM: Compaction skips non-order-0 pages. Already problem for
>     page-cache pages today.
> 
>   links:
>     - https://lore.kernel.org/linux-mm/ZKgPIXSrxqymWrsv@xxxxxxxxxxxxxxxxxxxx/
>     - https://lore.kernel.org/linux-mm/C56EA745-E112-4887-8C22-B74FCB6A14EB@xxxxxxxxxx/
> 
>   location:
>     - compaction_alloc()
> 
>   assignee:
>     Zi Yan <ziy@xxxxxxxxxx>
> 
> 
> 
> - item:
>     mlock
> 
>   priority:
>     prerequisite
> 
>   description: >-
>     Large, pte-mapped folios are ignored when mlock is requested. Code comment
>     for mlock_vma_folio() says "...filter out pte mappings of THPs, which cannot
>     be consistently counted: a pte mapping of the THP head cannot be
>     distinguished by the page alone."
> 
>   location:
>     - mlock_pte_range()
>     - mlock_vma_folio()
> 
>   links:
>     - https://lore.kernel.org/linux-mm/20230712060144.3006358-1-fengwei.yin@xxxxxxxxx/
> 
>   assignee:
>     Yin, Fengwei <fengwei.yin@xxxxxxxxx>
> 
> 
> 
> - item:
>     madvise
> 
>   priority:
>     prerequisite
> 
>   description: >-
>     MADV_COLD, MADV_PAGEOUT, MADV_FREE: For large folios, code assumes exclusive
>     only if mapcount==1, else skips remainder of operation. For large,
>     pte-mapped folios, exclusive folios can have mapcount upto nr_pages and
>     still be exclusive. Even better; don't split the folio if it fits entirely
>     within the range. Likely depends on "shared vs exclusive mappings".
> 
>   links:
>     - https://lore.kernel.org/linux-mm/20230713150558.200545-1-fengwei.yin@xxxxxxxxx/
> 
>   location:
>     - madvise_cold_or_pageout_pte_range()
>     - madvise_free_pte_range()
> 
>   assignee:
>     Yin, Fengwei <fengwei.yin@xxxxxxxxx>
> 
> 
> 
> - item:
>     deferred_split_folio
> 
>   priority:
>     prerequisite
> 
>   description: >-
>     zap_pte_range() will remove each page of a large folio from the rmap, one at
>     a time, causing the rmap code to see the folio as partially mapped and call
>     deferred_split_folio() for it. Then it subsquently becmes fully unmapped and
>     it is removed from the queue. This can cause some lock contention. Proposed
>     fix is to modify to zap_pte_range() to "batch zap" a whole pte range that
>     corresponds to a folio to avoid the unneccessary deferred_split_folio()
>     call.
> 
>   links:
>     - https://lore.kernel.org/linux-mm/20230719135450.545227-1-ryan.roberts@xxxxxxx/
> 
>   location:
>     - zap_pte_range()
> 
>   assignee:
>     Ryan Roberts <ryan.roberts@xxxxxxx>
> 
> 
> 
> - item:
>     numa balancing
> 
>   priority:
>     prerequisite
> 
>   description: >-
>     Large, pte-mapped folios are ignored by numa-balancing code. Commit comment
>     (e81c480): "We're going to have THP mapped with PTEs. It will confuse
>     numabalancing. Let's skip them for now." Likely depends on "shared vs
>     exclusive mappings".
> 
>   links: []
> 
>   location:
>     - do_numa_page()
> 
>   assignee:
>     <none>
> 
> 
> 
> - item:
>     large folios in swap cache
> 
>   priority:
>     enhancement
> 
>   description: >-
>     shrink_folio_list() currently splits large folios to single pages before
>     adding them to the swap cache. It would be preferred to add the large folio
>     as an atomic unit to the swap cache. It is still expected that each page
>     would use a separate swap entry when swapped out. This represents an
>     efficiency improvement. There is risk that this change will expose bad
>     assumptions in the swap cache that assume any large folio is pmd-mappable.
> 
>   links:
>     - https://lore.kernel.org/linux-mm/CAOUHufbC76OdP16mRsY3i920qB7khcu8FM+nUOG0kx5BMRdKXw@xxxxxxxxxxxxxx/
> 
>   location:
>     - shrink_folio_list()
> 
>   assignee:
>     <none>
> 
> -----