On 21/01/25 3:49 pm, David Hildenbrand wrote:
Hmm that's an interesting idea; If I've understood, we would
effectively test
the PMD for collapse as if we were collapsing to PMD-size, but then do
the
actual collapse to the "highest allowed order" (dictated by what's
enabled +
MADV_HUGEPAGE config).
I'm not so sure this is a good way to go; there would be no way to
support VMAs
(or parts of VMAs) that don't span a full PMD.
In Nicos approach to locking, we temporarily have to remove the PTE
table either way. While holding the mmap lock in write mode, the VMAs
cannot go away, so we could scan the whole PTE table to figure it out.
To just figure out "none" vs. "non-none" vs. "swap PTE", we'd probably
don't need the other VMA information. Figuring out "shared" is trickier,
because we have to obtain the folio and would have to walk the other VMAs.
It's a good question if we would have to VMA-write-lock the other
affected VMAs as well in order to temporarily remove the PTE table that
crosses multiple VMAs, or if we'd need something different (collapse PMD
marker) so the page table walkers could handle that case properly --
keep retrying or fallback to the mmap lock.
I missed this reply, could have saved me some trouble :) When collapsing
for VMAs < PMD, we *will* have to write lock the VMAs, write lock the
anon_vma's, and write lock vma->vm_file->f_mapping for file VMAs,
otherwise someone may fault on another VMA mapping the same PTE table. I
was trying to implement this, but cannot find a clean way: we will have
to implement it like mm_take_all_locks(), with a similar bit like
AS_MM_ALL_LOCKS, because, suppose we need to lock all anon_vma's, then
two VMAs may have the same anon_vma, and we cannot get away with the
following check:
lock only if !rwsem_is_locked(&vma->anon_vma->root->rwsem)
since I need to skip the lock only when it is khugepaged which has taken
the lock.
I guess the way to go about this then is the PMD-marker thingy, which I
am not very familiar with.
And I can imagine we might see
memory bloat; imagine you have 2M=madvise, 64K=always,
max_ptes_none=511, and
let's say we have a 2M (aligned portion of a) VMA that does NOT have
MADV_HUGEPAGE set and has a single page populated. It passes the PMD-
size test,
but we opt to collapse to 64K (since 2M=madvise). So now we end up
with 32x 64K
folios, 31 of which are all zeros. We have spent the same amount of
memory as if
2M=always. Perhaps that's a detail that could be solved by ignoring
fully none
64K blocks when collapsing to 64K...
Yes, that's what I had in mind. No need to collapse where there is
nothing at all ...
Personally, I think your "enforce simplicifation of the tunables for mTHP
collapse" idea is the best we have so far.
Right.
But I'll just push against your pushback of the per-VMA cursor idea
briefly. It
strikes me that this could be useful for khugepaged regardless of mTHP
support.
Not a clear pushback, as you say to me this is a different optimization
and I am missing how it could really solve the problem at hand here.
Note that we're already fighting with not growing VMAs (see the VMA
locking changes under review), but maybe we could still squeeze it in
there without requiring a bigger slab.
Today, it starts scanning a VMA, collapses the first PMD it finds that
meets the
requirements, then switches to scanning another VMA. When it
eventually gets
back to scanning the first VMA, it starts from the beginning again.
Wouldn't a
cursor help reduce the amount of scanning it has to do?
Yes, that whole scanning approach sound weird. I would have assumed that
it might nowdays be smarter to just scan the MM sequentially, and not
jump between VMAs.
Assume you only have a handfull of large VMAs (like in a VMM), you'd end
up scanning the same handful of VMAs over and over again.
I think a lot of the khugepaged codebase is just full with historical
baggage that must be cleaned up and re-validated if it still required ...