On 30.12.24 21:22, Barry Song wrote:
On Tue, Dec 31, 2024 at 8:32 AM David Hildenbrand <david@xxxxxxxxxx> wrote:
goto discard;
I agree that this is necessary, but I'm not sure it addresses my
concerns. MADV_FREE'ed mTHPs are still being added to `deferred_split`,
and this does not resolve the issue of them being partially unmapped
though it is definitely better than the existing code, at least folios are
not moved back to swap-backed.
> > On the other hand, users might rely on the `deferred_split` counter to
assess how aggressively userspace is performing address/size unaligned
operations
like MADV_DONTNEED or unmapped behavior. However, our debugging shows
that the majority of `deferred_split` counter increments result from
aligned MADV_FREE operations. This diminishes the counter's usefulness
in reflecting unaligned userspace behavior.
Optimizing that is certainly something to look into, but the bigger
issue you describe arises from bad handling of speculative references.
Just imagine you indeed have a partially-mapped anon folio and the
remaining pages are MADV_FREE'ed. The problem with the speculative
reference would still apply.
If possible, I am still looking for some approach to entirely avoid
adding the folio to deferred_split and partially being unmapped.
Could the concept be something like this?
Very likely it's wrong, because you really have to assure that that
folio range is mapped here.
Proper folio PTE batching should be applied here -- folio_pte_batch() etc.
I agree that using `folio_pte_batch()` to check if all PTEs are mapped and
determining `any_dirty` for setting swap-backed is the right direction. I'm
just curious if `(!list_empty(&folio->_deferred_list))` or
`folio_test_partially_mapped(folio)` could replace it if we're aiming for
a smaller change :-)
Likely we should just do it cleanly using PTE batching without any such
optimizations. All unmapping (not just MADV_FREE) will benefit from PTE
batching :)
That can please the counters in many, but not all cases. Again, maybe
the deferred-split handling should be handled differently, and not
synchronously from rmap code.
I see 3 different work items
1) Fix mis-handling of speculative references
Agreed, the patch you're sending is absolutely necessary. I'd prefer
it lands sooner in some way. Would you like to post it?
I'm out until the 7th. I'll be happy if you could clean it up, test it
and send it out (Suggested-by: is sufficient).
2) Perform proper PTE batching during unmap/migration. Will improve
performance in any case.
Agreed. I remember discussing this with Ryan in an email thread about
a year ago, even for normal (non-MADV_FREE'ed) folios, but it seems
everyone has been busy with other priorities.
Unfortunately, yes.
This seems like a good time to start exploring the idea. We could begin
with MADV_FREE'ed folios and later extend it to normal folios—for
instance, by implementing batched setting of swap entries.
3) Try moving deferred-split handling out of rmap code into reclaim/
access-bit handling.
I'm not quite sure we still need this after having 1 and 2. With those,
we've been able to operate on the mTHP as a whole. Do we still
need to move deferred_split out of rmap?
I have various things in mind, like having mremap'ed parts of the folio
(cannot batch), user space zapping the folio in smaller chunks (cannot
batch), and ... removing folio->_nr_pages_mapped + page->_mapcount,
where we might not always detect "partially mapped" from rmap code and
want to detect that separately either way.
Ideally, we'd just move deferred-split handling out of rmap code, and
trigger that detection+handling from reclaim code. After all, the only
purpose of deferred-split is .. memory reclaim.
Anyhow, 1) and 2) are beneficial independent of 3).
--
Cheers,
David / dhildenb