On 25.01.23 19:43, Edgecombe, Rick P wrote:
On Wed, 2023-01-25 at 10:27 +0100, David Hildenbrand wrote:
Roughly speaking: if we abstract it that way and get all of the
"how
to
set it writable now?" out of core-MM, it not only is cleaner and
less
error prone, it might even allow other architectures that
implement
something comparable (e.g., using a dedicated HW bit) to actually
reuse
some of that work. Otherwise most of that "shstk" is really just
x86
specific ...
I guess the only cases we have to special case would be page
pinning
code where pte_write() would indicate that the PTE is writable
(well,
it
is, just not by "ordinary CPU instruction" context directly): but
you
do
that already, so ... :)
Sorry for stumbling over that this late, I only started looking
into
this when you CCed me on that one patch.
Sorry for not calling more attention to it earlier. Appreciate your
comments.
Previously versions of this series had changed some of these
pte_mkwrite() calls to maybe_mkwrite(), which of course takes a
vma.
This way an x86 implementation could use the VM_SHADOW_STACK vma
flag
to decide between pte_mkwrite() and pte_mkwrite_shstk(). The
feedback
was that in some of these code paths "maybe" isn't really an
option, it
*needs* to make it writable. Even though the logic was the same,
the
name of the function made it look wrong.
But another option could be to change pte_mkwrite() to take a vma.
This
would save using another software bit on x86, but instead requires
a
small change to each arch's pte_mkwrite().
I played with that idea shortly as well, but discarded it. I was not
able to convince myself that it wouldn't be required to pass in the
VMA
as well for things like pte_dirty(), pte_mkdirty(), pte_write(), ...
which would end up fairly ugly (or even impossible in thing slike
GUP-fast).
For example, I wonder how we'd be handling stuff like do_numa_page()
cleanly correctly, where we use pte_modify() + pte_mkwrite(), and
either
call might set the PTE writable and maintain dirty bit ...
pte_modify() is handled like this currently:
https://lore.kernel.org/lkml/20230119212317.8324-12-rick.p.edgecombe@xxxxxxxxx/
There has been a couple iterations on that. The current solution is to
do the Dirty->SavedDirty fixup if needed after the new prots are added.
Of course pte_modify() can't know whether you are are attempting to
create a shadow stack PTE with the prot you are passing in. But the
callers today explicitly call pte_mkwrite() after filling in the other
bits with pte_modify().
See below on my MAP_PRIVATE vs. MAP_SHARED comment.
Today this patch causes the pte_mkwrite() to be
skipped and another fault may be required in the mprotect() and numa
cases, but if we change pte_mkwrite() to take a VMA we can just make it
shadow stack to start.
It might be worth mentioning, there was a suggestion in the past to try
to have the shadow stack bits come out of vm_get_page_prot(), but MM
code would then try to map the zero page as (shadow stack) writable
when there was a normal (non-shadow stack) read access. So I had to
abandon that approach and rely on explicit calls to pte_mkwrite/shstk()
to make it shadow stack.
Thanks, do you have a pointer?
Having that said, maybe it could work with only a single saved-dirty
bit
and passing in the VMA for pte_mkwrite() only.
pte_wrprotect() would detect "writable=0,dirty=1" and move the dirty
bit
to the soft-dirty bit instead, resulting in
"writable=0,dirty=0,saved-dirty=1",
pte_dirty() would return dirty==1||saved-dirty==1.
pte_mkdirty() would set either set dirty=1 or saved-dirty=1,
depending
on the writable bit.
pte_mkclean() would clean both bits.
pte_write() would detect "writable == 1 || (writable==0 && dirty==1)"
pte_mkwrite() would act according to the VMA, and in addition, merge
the
saved-dirty bit into the dirty bit.
pte_modify() and mk_pte() .... would require more thought ...
Not sure I'm following what the mk_pte() problem would be. You mean if
Write=0,Dirty=1 is manually added to the prot?
Shouldn't people generally use the pte_mkwrite() helpers unless they
are drawing from a prot that was already created with the helpers or
vm_get_page_prot()?
pte_mkwrite() is mostly only used (except for writenotify ...) for
MAP_PRIVATE memory ("COW-able"). For MAP_SHARED memory,
vma->vm_page_prot in a VM_WRITE mapping already contains the write
permissions. pte_mkwrite() is not necessary (again, unless writenotify
is active).
I assume shstk VMAs don't apply to MAP_SHARED VMAs, which is why you
didn't stumble over that issue yet? Because I don't see how it could
work with MAP_SHARED VMAs.
The other thing I had in mind was that we have to make sure that we're
not accidentally setting "Write=0,Dirty=1" in mk_pte() / pte_modify().
Assume we had a "Write=1,Dirty=1" PTE, and we effectively wrprotect
using pte_modify(), we have to make sure to move the dirty bit to the
saved_dirty bit.
I think they can't manually create prot's from bits
in core mm code, right? And x86 arch code already has to be aware of
shadow stack. It's a bit of an assumption I guess, but I think maybe
not too crazy of one?
I think that's true. Arch code is supposed to deal with that IIRC.
Further, ptep_modify_prot_commit() might have to be adjusted to
properly
flush in all relevant cases IIRC.
Sorry, I'm not following. Can you elaborate? There is an adjustment
made in pte_flags_need_flush().
Note that I did not fully review all bits of this patch set, just
throwing out what was on my mind. If already handled, great.
x86's pte_mkwrite() would then be pretty close to maybe_mkwrite(),
but
maybe it could additionally warn if the vma is not writable. It
also
seems more aligned with your changes to stop taking hints from PTE
bits
and just look at the VMA? (I'm thinking about the dropping of the
dirty
check in GUP and dropping pte_saved_write())
The soft-shstk bit wouldn't be a hint, it would be logically
changing
the "type" of the PTE such that any other PTE functions can do the
right
thing without having to consume the VMA.
Yea, true.
Thanks for your comments and ideas here, I'll give the:
pte_t pte_mkwrite(struct vm_area_struct *vma, pte_t pte)
...solution a try.
Good!
--
Thanks,
David / dhildenb