On 23.01.23 21:47, Edgecombe, Rick P wrote:
On Mon, 2023-01-23 at 10:50 +0100, David Hildenbrand wrote:
On 19.01.23 22:22, Rick Edgecombe wrote:
The x86 Control-flow Enforcement Technology (CET) feature includes
a new
type of memory called shadow stack. This shadow stack memory has
some
unusual properties, which requires some core mm changes to function
properly.
Since shadow stack memory can be changed from userspace, is both
VM_SHADOW_STACK and VM_WRITE. But it should not be made
conventionally
writable (i.e. pte_mkwrite()). So some code that calls
pte_mkwrite() needs
to be adjusted.
One such case is when memory is made writable without an actual
write
fault. This happens in some mprotect operations, and also prot_numa
faults.
In both cases code checks whether it should be made
(conventionally)
writable by calling vma_wants_manual_pte_write_upgrade().
One way to fix this would be have code actually check if memory is
also
VM_SHADOW_STACK and in that case call pte_mkwrite_shstk(). But
since
most memory won't be shadow stack, just have simpler logic and skip
this
optimization by changing vma_wants_manual_pte_write_upgrade() to
not
return true for VM_SHADOW_STACK_MEMORY. This will simply handle all
cases of this type.
Cc: David Hildenbrand <david@xxxxxxxxxx>
Tested-by: Pengfei Xu <pengfei.xu@xxxxxxxxx>
Tested-by: John Allen <john.allen@xxxxxxx>
Signed-off-by: Yu-cheng Yu <yu-cheng.yu@xxxxxxxxx>
Reviewed-by: Kirill A. Shutemov <kirill.shutemov@xxxxxxxxxxxxxxx>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@xxxxxxxxx>
---
Instead of having these x86-shadow stack details all over the MM
space,
was the option explored to handle this more in arch specific code?
IIUC, one way to get it working would be
1) Have a SW "shadowstack" PTE flag.
2) Have an "SW-dirty" PTE flag, to store "dirty=1" when "write=0".
I don't think that idea came up. So vma->vm_page_prot would have the SW
shadow stack flag for VM_SHADOW_STACK, and pte_mkwrite() could do
Write=0,Dirty=1 part. It seems like it should work.
Right, if we include it in vma->vm_page_prot, we'd immediately let
mk_pte() just handle that.
Otherwise, we'd have to refactor e.g., mk_pte() to consume a vma instead
of the vma->vm_page_prot. Let's see if we can avoid that for now.
pte_mkwrite(), pte_write(), pte_dirty ... can then make decisions
based
on the "shadowstack" PTE flag and hide all these details from core-
mm.
When mapping a shadowstack page (new page, migration, swapin, ...),
which can be obtained by looking at the VMA flags, the first thing
you'd
do is set the "shadowstack" PTE flag.
I guess the downside is that it uses an extra software bit. But the
other positive is that it's less error prone, so that someone writing
core-mm code won't introduce a change that makes shadow stack VMAs
Write=1 if they don't know to also check for VM_SHADOW_STACK.
Right. And I think this mimics the what I would have expected HW to
provide: a dedicated HW bit, not somehow mangling this into semantics of
existing bits.
Roughly speaking: if we abstract it that way and get all of the "how to
set it writable now?" out of core-MM, it not only is cleaner and less
error prone, it might even allow other architectures that implement
something comparable (e.g., using a dedicated HW bit) to actually reuse
some of that work. Otherwise most of that "shstk" is really just x86
specific ...
I guess the only cases we have to special case would be page pinning
code where pte_write() would indicate that the PTE is writable (well, it
is, just not by "ordinary CPU instruction" context directly): but you do
that already, so ... :)
Sorry for stumbling over that this late, I only started looking into
this when you CCed me on that one patch.
--
Thanks,
David / dhildenb