On 11.07.24 19:57, Linus Torvalds wrote:
On Thu, 11 Jul 2024 at 10:09, Jason A. Donenfeld <Jason@xxxxxxxxx> wrote:
When I was working on this patchset this year with the syscall, this is
similar somewhat to the initial approach I was taking with setting up a
special mapping. It turned into kind of a mess and I couldn't get it
working. There's a lot of functionality built around anonymous pages
that would need to be duplicated (I think?).
Yeah, I was kind of assuming that. You'd need to handle VM_DROPPABLE
in the fault path specially, the way we currently split up based on
vma_is_anonymous(), eg
if (vma_is_anonymous(vmf->vma))
return do_anonymous_page(vmf);
else
return do_fault(vmf);
in do_pte_missing() etc.
I don't actually think it would be too hard, but it's a more
"conceptual" change, and it's probably not worth it.
Alright, an hour later of fiddling, and it doesn't actually work (yet?)
-- the selftest fails. A diff follows below.
May I suggest a slightly different approach: do what we did for "pte_mkwrite()".
It needed the vma too, for not too dissimilar reasons: special dirty
bit handling for the shadow stack. See
bb3aadf7d446 ("x86/mm: Start actually marking _PAGE_SAVED_DIRTY")
b497e52ddb2a ("x86/mm: Teach pte_mkwrite() about stack memory")
and now we have "pte_mkwrite_novma()" with the old semantics for the
legacy cases that didn't get converted - whether it's because the
architecture doesn't have the issue, or because it's a kernel pte.
And the conversion was actually quite pain-free, because we have
#ifndef pte_mkwrite
static inline pte_t pte_mkwrite(pte_t pte, struct vm_area_struct *vma)
{
return pte_mkwrite_novma(pte);
}
#endif
so all any architecture that didn't want this needed to do was to
rename their pte_mkwrite() to pte_mkwrite_novma() and they were done.
In fact, that was done first as basically semantically no-op patches:
2f0584f3f4bd ("mm: Rename arch pte_mkwrite()'s to pte_mkwrite_novma()")
6ecc21bb432d ("mm: Move pte/pmd_mkwrite() callers with no VMA to _novma()")
161e393c0f63 ("mm: Make pte_mkwrite() take a VMA")
which made this all very pain-free (and was largely a sed script, I think).
- !pte_dirty(pte) && !PageDirty(page))
+ !pte_dirty(pte) && !PageDirty(page) &&
+ !(vma->vm_flags & VM_DROPPABLE))
So instead of this kind of thing, we'd have
- !pte_dirty(pte) && !PageDirty(page))
+ !pte_dirty(pte, vma) && !PageDirty(page) &&
and the advantage here is that you can't miss anybody by mistake. The
compiler will be very unhappy if you don't pass in the vma, and then
any places that would be converted to "pte_dirty_novma()"
We don't actually have all that many users of pte_dirty(), so it
doesn't look too nasty. And if we make the pte_dirty() semantics
depend on the vma, I really think we should do it the same way we did
pte_mkwrite().
We also have these folio_mark_dirty() calls, for example in
unpin_user_pages_dirty_lock(). Hm ... so preventing the folio from
getting dirtied is likely shaky.
I guess we need a way to just reliably identify these folios :/.
--
Cheers,
David / dhildenb