Re: [PATCH v5 18/39] mm: Handle faultless write upgrades for shstk

"Edgecombe, Rick P" <rick.p.edgecombe@xxxxxxxxx> · Wed, 25 Jan 2023 18:43:47 +0000

On Wed, 2023-01-25 at 10:27 +0100, David Hildenbrand wrote:
> > > Roughly speaking: if we abstract it that way and get all of the
> > > "how
> > > to
> > > set it writable now?" out of core-MM, it not only is cleaner and
> > > less
> > > error prone, it might even allow other architectures that
> > > implement
> > > something comparable (e.g., using a dedicated HW bit) to actually
> > > reuse
> > > some of that work. Otherwise most of that "shstk" is really just
> > > x86
> > > specific ...
> > > 
> > > I guess the only cases we have to special case would be page
> > > pinning
> > > code where pte_write() would indicate that the PTE is writable
> > > (well,
> > > it
> > > is, just not by "ordinary CPU instruction" context directly): but
> > > you
> > > do
> > > that already, so ... :)
> > > 
> > > Sorry for stumbling over that this late, I only started looking
> > > into
> > > this when you CCed me on that one patch.
> > 
> > Sorry for not calling more attention to it earlier. Appreciate your
> > comments.
> > 
> > Previously versions of this series had changed some of these
> > pte_mkwrite() calls to maybe_mkwrite(), which of course takes a
> > vma.
> > This way an x86 implementation could use the VM_SHADOW_STACK vma
> > flag
> > to decide between pte_mkwrite() and pte_mkwrite_shstk(). The
> > feedback
> > was that in some of these code paths "maybe" isn't really an
> > option, it
> > *needs* to make it writable. Even though the logic was the same,
> > the
> > name of the function made it look wrong.
> > 
> > But another option could be to change pte_mkwrite() to take a vma.
> > This
> > would save using another software bit on x86, but instead requires
> > a
> > small change to each arch's pte_mkwrite().
> 
> I played with that idea shortly as well, but discarded it. I was not 
> able to convince myself that it wouldn't be required to pass in the
> VMA 
> as well for things like pte_dirty(), pte_mkdirty(), pte_write(), ... 
> which would end up fairly ugly (or even impossible in thing slike
> GUP-fast).
> 
> For example, I wonder how we'd be handling stuff like do_numa_page() 
> cleanly correctly, where we use pte_modify() + pte_mkwrite(), and
> either 
> call might set the PTE writable and maintain dirty bit ...

pte_modify() is handled like this currently:

https://lore.kernel.org/lkml/20230119212317.8324-12-rick.p.edgecombe@xxxxxxxxx/

There has been a couple iterations on that. The current solution is to
do the Dirty->SavedDirty fixup if needed after the new prots are added.

Of course pte_modify() can't know whether you are are attempting to
create a shadow stack PTE with the prot you are passing in. But the
callers today explicitly call pte_mkwrite() after filling in the other
bits with pte_modify(). Today this patch causes the pte_mkwrite() to be
skipped and another fault may be required in the mprotect() and numa
cases, but if we change pte_mkwrite() to take a VMA we can just make it
shadow stack to start.

It might be worth mentioning, there was a suggestion in the past to try
to have the shadow stack bits come out of vm_get_page_prot(), but MM
code would then try to map the zero page as (shadow stack) writable
when there was a normal (non-shadow stack) read access. So I had to
abandon that approach and rely on explicit calls to pte_mkwrite/shstk()
to make it shadow stack.

> 
> Having that said, maybe it could work with only a single saved-dirty
> bit 
> and passing in the VMA for pte_mkwrite() only.
> 
> pte_wrprotect() would detect "writable=0,dirty=1" and move the dirty
> bit 
> to the soft-dirty bit instead, resulting in 
> "writable=0,dirty=0,saved-dirty=1",
> 
> pte_dirty() would return dirty==1||saved-dirty==1.
> 
> pte_mkdirty() would set either set dirty=1 or saved-dirty=1,
> depending 
> on the writable bit.
> 
> pte_mkclean() would clean both bits.
> 
> pte_write() would detect "writable == 1 || (writable==0 && dirty==1)"
> 
> pte_mkwrite() would act according to the VMA, and in addition, merge
> the 
> saved-dirty bit into the dirty bit.
> 
> pte_modify() and mk_pte() .... would require more thought ...

Not sure I'm following what the mk_pte() problem would be. You mean if
Write=0,Dirty=1 is manually added to the prot?

Shouldn't people generally use the pte_mkwrite() helpers unless they
are drawing from a prot that was already created with the helpers or
vm_get_page_prot()? I think they can't manually create prot's from bits
in core mm code, right? And x86 arch code already has to be aware of
shadow stack. It's a bit of an assumption I guess, but I think maybe
not too crazy of one?

> 
> 
> Further, ptep_modify_prot_commit() might have to be adjusted to
> properly 
> flush in all relevant cases IIRC.

Sorry, I'm not following. Can you elaborate? There is an adjustment
made in pte_flags_need_flush().

> 
> > 
> > x86's pte_mkwrite() would then be pretty close to maybe_mkwrite(),
> > but
> > maybe it could additionally warn if the vma is not writable. It
> > also
> > seems more aligned with your changes to stop taking hints from PTE
> > bits
> > and just look at the VMA? (I'm thinking about the dropping of the
> > dirty
> > check in GUP and dropping pte_saved_write())
> 
> The soft-shstk bit wouldn't be a hint, it would be logically
> changing 
> the "type" of the PTE such that any other PTE functions can do the
> right 
> thing without having to consume the VMA.

Yea, true.

Thanks for your comments and ideas here, I'll give the:
pte_t pte_mkwrite(struct vm_area_struct *vma, pte_t pte)
...solution a try.