On 1/30/22 13:18, Rick Edgecombe wrote: > From: Yu-cheng Yu <yu-cheng.yu@xxxxxxxxx> > > There is essentially no room left in the x86 hardware PTEs on some OSes > (not Linux). That left the hardware architects looking for a way to > represent a new memory type (shadow stack) within the existing bits. > They chose to repurpose a lightly-used state: Write=0, Dirty=1. > > The reason it's lightly used is that Dirty=1 is normally set by hardware > and cannot normally be set by hardware on a Write=0 PTE. Software must > normally be involved to create one of these PTEs, so software can simply > opt to not create them. This is kinda skipping over something important: The reason it's lightly used is that Dirty=1 is normally set _before_ a write. A write with a Write=0 PTE would typically only generate a fault, not set Dirty=1. Hardware can (rarely) both set Write=1 *and* generate the fault, resulting in a Dirty=0,Write=1 PTE. Hardware which supports shadow stacks will no longer exhibit this oddity. > In places where Linux normally creates Write=0, Dirty=1, it can use the > software-defined _PAGE_COW in place of the hardware _PAGE_DIRTY. In other > words, whenever Linux needs to create Write=0, Dirty=1, it instead creates > Write=0, Cow=1, except for shadow stack, which is Write=0, Dirty=1. This > clearly separates shadow stack from other data, and results in the > following: Following _what_... What are these? I think they're PTE states. Best to say that. > (a) A modified, copy-on-write (COW) page: (Write=0, Cow=1) Could you give an example of this? Would this be a typical anonymous page which was Write=1,Dirty=1, then historically made Write=0,Dirty=1 at fork()? > (b) A R/O page that has been COW'ed: (Write=0, Cow=1) > The user page is in a R/O VMA, and get_user_pages() needs a writable > copy. The page fault handler creates a copy of the page and sets > the new copy's PTE as Write=0 and Cow=1. > (c) A shadow stack PTE: (Write=0, Dirty=1) > (d) A shared shadow stack PTE: (Write=0, Cow=1) > When a shadow stack page is being shared among processes (this happens > at fork()), its PTE is made Dirty=0, so the next shadow stack access > causes a fault, and the page is duplicated and Dirty=1 is set again. > This is the COW equivalent for shadow stack pages, even though it's > copy-on-access rather than copy-on-write. Just like code, it's also nice to format these in a way which allows them to be visually compared, trivially. So, let's expand all the bits and vertically align everything. To break this down a bit, we have two old states: [a] (Write=0, Dirty=0, Cow=1) [b] (Write=0, Dirty=0, Cow=1) And two new ones: [c] (Write=0, Dirty=1, Cow=0) [d] (Write=0, Dirty=0, Cow=1) That makes me wonder what the difference is between [a] and [b] and why they are separate. Is their handling different? How are those two states differentiated? > (e) A page where the processor observed a Write=1 PTE, started a write, set > Dirty=1, but then observed a Write=0 PTE. That's possible today, but > will not happen on processors that support shadow stack. This left me wondering how you are going to detangle the mess where PTEs look like shadow-stack PTEs on non-shadow-stack hardware. Could you cover that here? You can shorten that above bullet to this to help make the space: (e) (Write=0, Dirty=1, Cow=0) PTE created when a processor without shadow stack support set Dirty=1. > Define _PAGE_COW and update pte_*() helpers and apply the same changes to > pmd and pud. > > After this, there are six free bits left in the 64-bit PTE, and no more > free bits in the 32-bit PTE (except for PAE) and Shadow Stack is not > implemented for the 32-bit kernel. Just say: There are six bits left available to software in the 64-bit PTE after consuming a bit for _PAGE_COW. No space is consumed in 32-bit kernels because shadow stacks are not enabled there. There's no need to rub it in that 32-bit is out of space. > -static inline int pte_dirty(pte_t pte) > +static inline bool pte_dirty(pte_t pte) > { > - return pte_flags(pte) & _PAGE_DIRTY; > + /* > + * A dirty PTE has Dirty=1 or Cow=1. > + */ I don't really like that comment because "Cow" isn't anywhere to be found. > + return pte_flags(pte) & _PAGE_DIRTY_BITS; > +} > + > +static inline bool pte_shstk(pte_t pte) > +{ > + if (!cpu_feature_enabled(X86_FEATURE_SHSTK)) > + return false; > + > + return (pte_flags(pte) & (_PAGE_RW | _PAGE_DIRTY)) == _PAGE_DIRTY; > } > > static inline int pte_young(pte_t pte) > @@ -133,9 +144,20 @@ static inline int pte_young(pte_t pte) > return pte_flags(pte) & _PAGE_ACCESSED; > } > > -static inline int pmd_dirty(pmd_t pmd) > +static inline bool pmd_dirty(pmd_t pmd) > { > - return pmd_flags(pmd) & _PAGE_DIRTY; > + /* > + * A dirty PMD has Dirty=1 or Cow=1. > + */ > + return pmd_flags(pmd) & _PAGE_DIRTY_BITS; > +} > + > +static inline bool pmd_shstk(pmd_t pmd) > +{ > + if (!cpu_feature_enabled(X86_FEATURE_SHSTK)) > + return false; > + > + return (pmd_flags(pmd) & (_PAGE_RW | _PAGE_DIRTY)) == _PAGE_DIRTY; > } > > static inline int pmd_young(pmd_t pmd) > @@ -143,9 +165,12 @@ static inline int pmd_young(pmd_t pmd) > return pmd_flags(pmd) & _PAGE_ACCESSED; > } > > -static inline int pud_dirty(pud_t pud) > +static inline bool pud_dirty(pud_t pud) > { > - return pud_flags(pud) & _PAGE_DIRTY; > + /* > + * A dirty PUD has Dirty=1 or Cow=1. > + */ > + return pud_flags(pud) & _PAGE_DIRTY_BITS; > } > > static inline int pud_young(pud_t pud) > @@ -155,13 +180,23 @@ static inline int pud_young(pud_t pud) > > static inline int pte_write(pte_t pte) > { > - return pte_flags(pte) & _PAGE_RW; > + /* > + * Shadow stack pages are always writable - but not by normal > + * instructions, and only by shadow stack operations. Therefore, > + * the W=0,D=1 test with pte_shstk(). > + */ I think that comment is off a bit. It's not really connected to the code. We don't, for instance need to know what the bit combination is inside pte_shstk(). Further, it's a bit mean to talk about "W" in the comment and _PAGE_RW in the code. How about: /* * Shadow stack pages are logically writable, but do not have * _PAGE_RW. Check for them separately from _PAGE_RW itself. */ > + return (pte_flags(pte) & _PAGE_RW) || pte_shstk(pte); > } > > #define pmd_write pmd_write > static inline int pmd_write(pmd_t pmd) > { > - return pmd_flags(pmd) & _PAGE_RW; > + /* > + * Shadow stack pages are always writable - but not by normal > + * instructions, and only by shadow stack operations. Therefore, > + * the W=0,D=1 test with pmd_shstk(). > + */ > + return (pmd_flags(pmd) & _PAGE_RW) || pmd_shstk(pmd); > } Ditto on the comment. Please copy the pte_write() one here too. > > #define pud_write pud_write > @@ -299,6 +334,24 @@ static inline pte_t pte_clear_flags(pte_t pte, pteval_t clear) > return native_make_pte(v & ~clear); > } > > +static inline pte_t pte_mkcow(pte_t pte) > +{ > + if (!cpu_feature_enabled(X86_FEATURE_SHSTK)) > + return pte; > + > + pte = pte_clear_flags(pte, _PAGE_DIRTY); > + return pte_set_flags(pte, _PAGE_COW); > +} > + > +static inline pte_t pte_clear_cow(pte_t pte) > +{ > + if (!cpu_feature_enabled(X86_FEATURE_SHSTK)) > + return pte; > + > + pte = pte_set_flags(pte, _PAGE_DIRTY); > + return pte_clear_flags(pte, _PAGE_COW); > +} I think we need to say *SOMETHING* about the X86_FEATURE_SHSTK and _PAGE_COW connection here. Otherwise they look like two random features that are interacting in an unknown way. Maybe even something this simple: /* * _PAGE_COW is unnecessary on !X86_FEATURE_SHSTK kernels. * See the _PAGE_COW definition for more details. */ Also, the manipulation of _PAGE_DIRTY is not clear here. It's obvious why we have to: pte_clear_flags(pte, _PAGE_COW); in a function called pte_clear_cow() but, again, how does _PAGE_DIRTY fit? > #ifdef CONFIG_HAVE_ARCH_USERFAULTFD_WP > static inline int pte_uffd_wp(pte_t pte) > { > @@ -318,7 +371,7 @@ static inline pte_t pte_clear_uffd_wp(pte_t pte) > > static inline pte_t pte_mkclean(pte_t pte) > { > - return pte_clear_flags(pte, _PAGE_DIRTY); > + return pte_clear_flags(pte, _PAGE_DIRTY_BITS); > } > > static inline pte_t pte_mkold(pte_t pte) > @@ -328,7 +381,16 @@ static inline pte_t pte_mkold(pte_t pte) > > static inline pte_t pte_wrprotect(pte_t pte) > { > - return pte_clear_flags(pte, _PAGE_RW); > + pte = pte_clear_flags(pte, _PAGE_RW); > + > + /* > + * Blindly clearing _PAGE_RW might accidentally create > + * a shadow stack PTE (RW=0, Dirty=1). Move the hardware Could you grep this series and try to be consistent about the formatting here? (Not that I've been perfect in this regard either). I think we have at least: Write=X,Dirty=Y W=X,D=Y RW=X,Dirty=Y > + * dirty value to the software bit. > + */ > + if (pte_dirty(pte)) > + pte = pte_mkcow(pte); > + return pte; > } One of my logical checks for this is "does it all go away when this is compiled out". Because of this: +static inline pte_t pte_mkcow(pte_t pte) +{ + if (!cpu_feature_enabled(X86_FEATURE_SHSTK)) + return pte; ... the answer is yes! So, this looks good to me. Just thought I'd share a bit of my thought process. > static inline pte_t pte_mkexec(pte_t pte) > @@ -338,7 +400,18 @@ static inline pte_t pte_mkexec(pte_t pte) > > static inline pte_t pte_mkdirty(pte_t pte) > { > - return pte_set_flags(pte, _PAGE_DIRTY | _PAGE_SOFT_DIRTY); > + pteval_t dirty = _PAGE_DIRTY; > + > + /* Avoid creating (HW)Dirty=1, Write=0 PTEs */ The "(HW)" thing doesn't make a lot of sense any longer. I think we had a set of HWDirty and SWDirty bits, but SWDirty ended up being morphed over to _PAGE_COW. > + if (cpu_feature_enabled(X86_FEATURE_SHSTK) && !pte_write(pte)) > + dirty = _PAGE_COW; > + > + return pte_set_flags(pte, dirty | _PAGE_SOFT_DIRTY); > +} > + > +static inline pte_t pte_mkwrite_shstk(pte_t pte) > +{ > + return pte_clear_cow(pte); > } This one is a bit of black magic. This is taking a PTE from (presumably) states [c]->[d] from earlier in the changelog. Write=0,Dirty=0,Cow=1 to Write=0,Dirty=1,Cow=0 It's hard to wrap my head around how clearing a software bit (from the naming) will make this PTE writable. There's either something wrong with the naming, or something wrong with my mental model of what "COW clearing" is. > static inline pte_t pte_mkyoung(pte_t pte) > @@ -348,7 +421,12 @@ static inline pte_t pte_mkyoung(pte_t pte) > > static inline pte_t pte_mkwrite(pte_t pte) > { > - return pte_set_flags(pte, _PAGE_RW); > + pte = pte_set_flags(pte, _PAGE_RW); > + > + if (pte_dirty(pte)) > + pte = pte_clear_cow(pte); > + > + return pte; > } Along the same lines as the last few comments, this leaves me wondering why a pte_dirty() can't also be a "COW PTE". ... <snipping the pmd/pud copies> ... > #ifdef CONFIG_HAVE_ARCH_SOFT_DIRTY > diff --git a/arch/x86/include/asm/pgtable_types.h b/arch/x86/include/asm/pgtable_types.h > index 3781a79b6388..1bfab70ff9ac 100644 > --- a/arch/x86/include/asm/pgtable_types.h > +++ b/arch/x86/include/asm/pgtable_types.h > @@ -21,7 +21,8 @@ > #define _PAGE_BIT_SOFTW2 10 /* " */ > #define _PAGE_BIT_SOFTW3 11 /* " */ > #define _PAGE_BIT_PAT_LARGE 12 /* On 2MB or 1GB pages */ > -#define _PAGE_BIT_SOFTW4 58 /* available for programmer */ > +#define _PAGE_BIT_SOFTW4 57 /* available for programmer */ > +#define _PAGE_BIT_SOFTW5 58 /* available for programmer */ > #define _PAGE_BIT_PKEY_BIT0 59 /* Protection Keys, bit 1/4 */ > #define _PAGE_BIT_PKEY_BIT1 60 /* Protection Keys, bit 2/4 */ > #define _PAGE_BIT_PKEY_BIT2 61 /* Protection Keys, bit 3/4 */ > @@ -34,6 +35,15 @@ > #define _PAGE_BIT_SOFT_DIRTY _PAGE_BIT_SOFTW3 /* software dirty tracking */ > #define _PAGE_BIT_DEVMAP _PAGE_BIT_SOFTW4 > > +/* > + * Indicates a copy-on-write page. > + */ > +#ifdef CONFIG_X86_SHADOW_STACK > +#define _PAGE_BIT_COW _PAGE_BIT_SOFTW5 /* copy-on-write */ > +#else > +#define _PAGE_BIT_COW 0 > +#endif > + > /* If _PAGE_BIT_PRESENT is clear, we use these: */ > /* - if the user mapped it with PROT_NONE; pte_present gives true */ > #define _PAGE_BIT_PROTNONE _PAGE_BIT_GLOBAL > @@ -115,6 +125,36 @@ > #define _PAGE_DEVMAP (_AT(pteval_t, 0)) > #endif > > +/* > + * The hardware requires shadow stack to be read-only and Dirty. > + * _PAGE_COW is a software-only bit used to separate copy-on-write PTEs > + * from shadow stack PTEs: > + * (a) A modified, copy-on-write (COW) page: (Write=0, Cow=1) > + * (b) A R/O page that has been COW'ed: (Write=0, Cow=1) > + * The user page is in a R/O VMA, and get_user_pages() needs a > + * writable copy. The page fault handler creates a copy of the page > + * and sets the new copy's PTE as Write=0, Cow=1. > + * (c) A shadow stack PTE: (Write=0, Dirty=1) > + * (d) A shared (copy-on-access) shadow stack PTE: (Write=0, Cow=1) > + * When a shadow stack page is being shared among processes (this > + * happens at fork()), its PTE is cleared of _PAGE_DIRTY, so the next > + * shadow stack access causes a fault, and the page is duplicated and > + * _PAGE_DIRTY is set again. This is the COW equivalent for shadow > + * stack pages, even though it's copy-on-access rather than > + * copy-on-write. > + * (e) A page where the processor observed a Write=1 PTE, started a write, > + * set Dirty=1, but then observed a Write=0 PTE (changed by another > + * thread). That's possible today, but will not happen on processors > + * that support shadow stack. This info, again, is great. Let's keep it, but please do reformat it like the changelog version to make the bit states easier to grok.