On Thu, Jul 04, 2024 at 05:23:30PM +0200, David Hildenbrand wrote: > On 04.07.24 16:30, Peter Xu wrote: > > Hey, David, > > > > Hi! > > > On Thu, Jul 04, 2024 at 12:44:38PM +0200, David Hildenbrand wrote: > > > There are roughly two categories of page table walkers we have: > > > > > > 1) We actually only want to walk present folios (to be precise, page > > > ranges of folios). We should look into moving away from the walk the > > > page walker API where possible, and have something better that > > > directly gives us the folio (page ranges). Any PTE batching would be > > > done internally. > > > > > > 2) We want to deal with non-present folios as well (swp entries and all > > > kinds of other stuff). We should maybe implement our custom page > > > table walker and move away from walk_page_range(). We are not walking > > > "pages" after all but everything else included :) > > > > > > Then, there is a subset of 1) where we only want to walk to a single address > > > (a single folio). I'm working on that right now to get rid of follow_page() > > > and some (IIRC 3: KSM an daemon) walk_page_range() users. Hugetlb will still > > > remain a bit special, but I'm afraid we cannot hide that completely. > > > > Maybe you are talking about the generic concept of "page table walker", not > > walk_page_range() explicitly? > > > > I'd agree if it's about the generic concept. For example, follow_page() > > definitely is tailored for getting the page/folio. But just to mention > > Oscar's series is only working on the page_walk API itself. What I see so > > far is most of the walk_page API users aren't described above - most of > > them do not fall into category 1) at all, if any. And they either need to > > fetch something from the pgtable where having the folio isn't enough, or > > modify the pgtable for different reasons. > > Right, but having 1) does not imply that we won't be having access to the > page table entry in an abstracted form, the folio is simply the primary > source of information that these users care about. 2) is an extension of 1), > but walking+exposing all (or most) other page table entries as well in some > form, which is certainly harder to get right. > > Taking a look at some examples: > > * madvise_cold_or_pageout_pte_range() only cares about present folios. > * madvise_free_pte_range() only cares about present folios. > * break_ksm_ops() only cares about present folios. > * mlock_walk_ops() only cares about present folios. > * damon_mkold_ops() only cares about present folios. > * damon_young_ops() only cares about present folios. > > There are certainly other page_walk API users that are more involved and > need to do way more magic, which fall into category 2). In particular things > like swapin_walk_ops(), hmm_walk_ops() and most fs/proc/task_mmu.c. Likely > there are plenty of them. > > > Taking a look at vmscan.c/walk_mm(), I'm not sure how much benefit there > even is left in using walk_page_range() :) Hmm, I need to confess from a quick look I didn't yet see why the current page_walk API won't work under p4d there.. it could be that I missed some details. > > > > > A generic pgtable walker looks still wanted at some point, but it can be > > too involved to be introduced together with this "remove hugetlb_entry" > > effort. > > My thinking was if "remove hugetlb_entry" cannot wait for "remove > page_walk", because we found a reasonable way to do it better and convert > the individual users. Maybe it can't. > > I've not given up hope that we can end up with something better and clearer > than the current page_walk API :) Oh so you meant you have plan to rewrite some of the page_walk API users to use the new API you plan to propose? It looks fine by me. I assume anything new will already taking hugetlb folios into account, so it'll "just work" and actually reduce number of patches here, am I right? If it still needs time to land, I think it's also fine that it's done on top of Oscar's. So it may boil down to the schedule in that case, and we may also want to know how Oscar sees this. > > > > > To me, that future work is not yet about "get the folio, ignore the > > pgtable", but about how to abstract different layers of pgtables, so the > > caller may get a generic concept of "one pgtable entry" with the level/size > > information attached, and process it at a single place / hook, and perhaps > > hopefully even work with a device pgtable, as long as it's a radix tree. > > To me 2) is an extension of 1). My thinking is that we can start with 1) > without having to are about all details of 2). If we have to make it as > generic that we can walk any page table layout out there in this world, I'm > not so sure. I still see a hope there, after all the radix pgtable is indeed a common abstraction and it looks to me a lot of things share that structure. IIUC one challenge of it is being fast. So.. I don't know. But I'll be more than happy to see it come if someone can work it out, and it just sounds very nice too if some chunk of code can be run the same for mm/, kvm/ and iommu/. Thanks, -- Peter Xu