On Wed, 2023-01-25 at 10:27 +0100, David Hildenbrand wrote: > > > Roughly speaking: if we abstract it that way and get all of the > > > "how > > > to > > > set it writable now?" out of core-MM, it not only is cleaner and > > > less > > > error prone, it might even allow other architectures that > > > implement > > > something comparable (e.g., using a dedicated HW bit) to actually > > > reuse > > > some of that work. Otherwise most of that "shstk" is really just > > > x86 > > > specific ... > > > > > > I guess the only cases we have to special case would be page > > > pinning > > > code where pte_write() would indicate that the PTE is writable > > > (well, > > > it > > > is, just not by "ordinary CPU instruction" context directly): but > > > you > > > do > > > that already, so ... :) > > > > > > Sorry for stumbling over that this late, I only started looking > > > into > > > this when you CCed me on that one patch. > > > > Sorry for not calling more attention to it earlier. Appreciate your > > comments. > > > > Previously versions of this series had changed some of these > > pte_mkwrite() calls to maybe_mkwrite(), which of course takes a > > vma. > > This way an x86 implementation could use the VM_SHADOW_STACK vma > > flag > > to decide between pte_mkwrite() and pte_mkwrite_shstk(). The > > feedback > > was that in some of these code paths "maybe" isn't really an > > option, it > > *needs* to make it writable. Even though the logic was the same, > > the > > name of the function made it look wrong. > > > > But another option could be to change pte_mkwrite() to take a vma. > > This > > would save using another software bit on x86, but instead requires > > a > > small change to each arch's pte_mkwrite(). > > I played with that idea shortly as well, but discarded it. I was not > able to convince myself that it wouldn't be required to pass in the > VMA > as well for things like pte_dirty(), pte_mkdirty(), pte_write(), ... > which would end up fairly ugly (or even impossible in thing slike > GUP-fast). > > For example, I wonder how we'd be handling stuff like do_numa_page() > cleanly correctly, where we use pte_modify() + pte_mkwrite(), and > either > call might set the PTE writable and maintain dirty bit ... pte_modify() is handled like this currently: https://lore.kernel.org/lkml/20230119212317.8324-12-rick.p.edgecombe@xxxxxxxxx/ There has been a couple iterations on that. The current solution is to do the Dirty->SavedDirty fixup if needed after the new prots are added. Of course pte_modify() can't know whether you are are attempting to create a shadow stack PTE with the prot you are passing in. But the callers today explicitly call pte_mkwrite() after filling in the other bits with pte_modify(). Today this patch causes the pte_mkwrite() to be skipped and another fault may be required in the mprotect() and numa cases, but if we change pte_mkwrite() to take a VMA we can just make it shadow stack to start. It might be worth mentioning, there was a suggestion in the past to try to have the shadow stack bits come out of vm_get_page_prot(), but MM code would then try to map the zero page as (shadow stack) writable when there was a normal (non-shadow stack) read access. So I had to abandon that approach and rely on explicit calls to pte_mkwrite/shstk() to make it shadow stack. > > Having that said, maybe it could work with only a single saved-dirty > bit > and passing in the VMA for pte_mkwrite() only. > > pte_wrprotect() would detect "writable=0,dirty=1" and move the dirty > bit > to the soft-dirty bit instead, resulting in > "writable=0,dirty=0,saved-dirty=1", > > pte_dirty() would return dirty==1||saved-dirty==1. > > pte_mkdirty() would set either set dirty=1 or saved-dirty=1, > depending > on the writable bit. > > pte_mkclean() would clean both bits. > > pte_write() would detect "writable == 1 || (writable==0 && dirty==1)" > > pte_mkwrite() would act according to the VMA, and in addition, merge > the > saved-dirty bit into the dirty bit. > > pte_modify() and mk_pte() .... would require more thought ... Not sure I'm following what the mk_pte() problem would be. You mean if Write=0,Dirty=1 is manually added to the prot? Shouldn't people generally use the pte_mkwrite() helpers unless they are drawing from a prot that was already created with the helpers or vm_get_page_prot()? I think they can't manually create prot's from bits in core mm code, right? And x86 arch code already has to be aware of shadow stack. It's a bit of an assumption I guess, but I think maybe not too crazy of one? > > > Further, ptep_modify_prot_commit() might have to be adjusted to > properly > flush in all relevant cases IIRC. Sorry, I'm not following. Can you elaborate? There is an adjustment made in pte_flags_need_flush(). > > > > > x86's pte_mkwrite() would then be pretty close to maybe_mkwrite(), > > but > > maybe it could additionally warn if the vma is not writable. It > > also > > seems more aligned with your changes to stop taking hints from PTE > > bits > > and just look at the VMA? (I'm thinking about the dropping of the > > dirty > > check in GUP and dropping pte_saved_write()) > > The soft-shstk bit wouldn't be a hint, it would be logically > changing > the "type" of the PTE such that any other PTE functions can do the > right > thing without having to consume the VMA. Yea, true. Thanks for your comments and ideas here, I'll give the: pte_t pte_mkwrite(struct vm_area_struct *vma, pte_t pte) ...solution a try.