Re: [PATCH 00/45] hugetlb pagewalk unification

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 04.07.24 16:30, Peter Xu wrote:
Hey, David,


Hi!

On Thu, Jul 04, 2024 at 12:44:38PM +0200, David Hildenbrand wrote:
There are roughly two categories of page table walkers we have:

1) We actually only want to walk present folios (to be precise, page
    ranges of folios). We should look into moving away from the walk the
    page walker API where possible, and have something better that
    directly gives us the folio (page ranges). Any PTE batching would be
    done internally.

2) We want to deal with non-present folios as well (swp entries and all
    kinds of other stuff). We should maybe implement our custom page
    table walker and move away from walk_page_range(). We are not walking
    "pages" after all but everything else included :)

Then, there is a subset of 1) where we only want to walk to a single address
(a single folio). I'm working on that right now to get rid of follow_page()
and some (IIRC 3: KSM an daemon) walk_page_range() users. Hugetlb will still
remain a bit special, but I'm afraid we cannot hide that completely.

Maybe you are talking about the generic concept of "page table walker", not
walk_page_range() explicitly?

I'd agree if it's about the generic concept. For example, follow_page()
definitely is tailored for getting the page/folio.  But just to mention
Oscar's series is only working on the page_walk API itself.  What I see so
far is most of the walk_page API users aren't described above - most of
them do not fall into category 1) at all, if any. And they either need to
fetch something from the pgtable where having the folio isn't enough, or
modify the pgtable for different reasons.

Right, but having 1) does not imply that we won't be having access to the page table entry in an abstracted form, the folio is simply the primary source of information that these users care about. 2) is an extension of 1), but walking+exposing all (or most) other page table entries as well in some form, which is certainly harder to get right.

Taking a look at some examples:

* madvise_cold_or_pageout_pte_range() only cares about present folios.
* madvise_free_pte_range() only cares about present folios.
* break_ksm_ops() only cares about present folios.
* mlock_walk_ops() only cares about present folios.
* damon_mkold_ops() only cares about present folios.
* damon_young_ops() only cares about present folios.

There are certainly other page_walk API users that are more involved and need to do way more magic, which fall into category 2). In particular things like swapin_walk_ops(), hmm_walk_ops() and most fs/proc/task_mmu.c. Likely there are plenty of them.


Taking a look at vmscan.c/walk_mm(), I'm not sure how much benefit there even is left in using walk_page_range() :)


A generic pgtable walker looks still wanted at some point, but it can be
too involved to be introduced together with this "remove hugetlb_entry"
effort.

My thinking was if "remove hugetlb_entry" cannot wait for "remove page_walk", because we found a reasonable way to do it better and convert the individual users. Maybe it can't.

I've not given up hope that we can end up with something better and clearer than the current page_walk API :)


To me, that future work is not yet about "get the folio, ignore the
pgtable", but about how to abstract different layers of pgtables, so the
caller may get a generic concept of "one pgtable entry" with the level/size
information attached, and process it at a single place / hook, and perhaps
hopefully even work with a device pgtable, as long as it's a radix tree.

To me 2) is an extension of 1). My thinking is that we can start with 1) without having to are about all details of 2). If we have to make it as generic that we can walk any page table layout out there in this world, I'm not so sure.

--
Cheers,

David / dhildenb





[Index of Archives]     [Linux ARM Kernel]     [Linux ARM]     [Linux Omap]     [Fedora ARM]     [IETF Annouce]     [Bugtraq]     [Linux OMAP]     [Linux MIPS]     [eCos]     [Asterisk Internet PBX]     [Linux API]

  Powered by Linux