On 05.02.25 10:33, Oscar Salvador wrote:
On Tue, Feb 04, 2025 at 09:40:16PM +0100, David Hildenbrand wrote:
Unfortunately not that easy. You can only have parts of a large folio
mapped. It would be PAGE_SIZE * nr_ptes when batching.
I see, you are right.
I think this is the wrong approach. We should replace that pagewalk API
usage by something better that hides all the batching.
The interface would look similar to folio_walk (no callbacks, handling of
locking), but (a) work on ranges; (b) work also on non-folio entries; and
(c) batch all suitable entries in the range.
Something like a pt_range_walk_start() that returns a "type" (folio range,
migration entries, swap range, ...) + stores other details (range, level,
ptep, ...) in a structure like folio_walk, to then provide mechanisms to
continue (pt_walk_continue()) to walk or abort (pt_walk_done()) it. Similar
to page_vma_mapped_walk(), but not specific to a given page/folio.
Then, we would simply process the output of that. With the hope that, for
hugetlb it will just batch all cont-pte / cont-pmd entries into a single
return value.
That will make the R/O walking as in task_mmu.c easier, hopefully.
Not so much with PTE/PMD modifications, like damon_mkold_ops ... :( But
maybe that just has to be special-cased for hugetlb, somehow ...
Ok, let me see if we are on the same page.
You are basically saying that we should replace the existing pagewalk API with
something similar to what you described above.
Yes, something that adds a better abstraction and walks ranges (e.g.,
folio range, swap range, ..) moving away from the per-pte callbacks etc.
I have to confess that when you first mentioned this back in July when I
posted the RFC, I felt dishearted, because it implies an even bigger
surgery.
Yeah. But it will be worth it in the long run (at least I hope so ;) )
But having felt the mess that dealing with cont-{pmd,pud}s, and the
inability of the existing API to do that in a clean way (without having
to teach each and every function about that if needed), maybe it is the
only way to do this 1) right and 2) clean.
Jup
I thought that maybe we can get away and to the batching somehow before
calling in the callbacks e.g: at walk_{pud,pmd,pte}_range level, but I
am not sure whether 1) that is possible and 2) how ugly it would look.
Exactly.
So, given that this is not a really urgent matter, something that needs
to be fixed asap, maybe the way to go is to create an API that can deal
with all that, abstracting all these details.
Yes, it's been on my todo list for longer (after I completed
folio_walk), but I didn't get to it yet.
I am willing to take a shot on this, if we are clear that it makes sense
to pursue this road.
Yes, feel free to use what I propose as a starting point. If I get to
prototype something in the meantime, I'll send it to you. (staring at my
inbox, I'm not so sure I'll get to it shortly :) )
--
Cheers,
David / dhildenb