Re: [PATCH v5 10/39] x86/mm: Introduce _PAGE_COW

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 19.01.23 22:22, Rick Edgecombe wrote:
Some OSes have a greater dependence on software available bits in PTEs than
Linux. That left the hardware architects looking for a way to represent a
new memory type (shadow stack) within the existing bits. They chose to
repurpose a lightly-used state: Write=0,Dirty=1. So in order to support
shadow stack memory, Linux should avoid creating memory with this PTE bit
combination unless it intends for it to be shadow stack.

The reason it's lightly used is that Dirty=1 is normally set by HW
_before_ a write. A write with a Write=0 PTE would typically only generate
a fault, not set Dirty=1. Hardware can (rarely) both set Dirty=1 *and*
generate the fault, resulting in a Write=0,Dirty=1 PTE. Hardware which
supports shadow stacks will no longer exhibit this oddity.

So that leaves Write=0,Dirty=1 PTEs created in software. To achieve this,
in places where Linux normally creates Write=0,Dirty=1, it can use the
software-defined _PAGE_COW in place of the hardware _PAGE_DIRTY. In other
words, whenever Linux needs to create Write=0,Dirty=1, it instead creates
Write=0,Cow=1 except for shadow stack, which is Write=0,Dirty=1.
Further differentiated by VMA flags, these PTE bit combinations would be
set as follows for various types of memory:

(Write=0,Cow=1,Dirty=0):
  - A modified, copy-on-write (COW) page. Previously when a typical
    anonymous writable mapping was made COW via fork(), the kernel would
    mark it Write=0,Dirty=1. Now it will instead use the Cow bit. This
    happens in copy_present_pte().
  - A R/O page that has been COW'ed. The user page is in a R/O VMA,
    and get_user_pages(FOLL_FORCE) needs a writable copy. The page fault
    handler creates a copy of the page and sets the new copy's PTE as
    Write=0 and Cow=1.
  - A shared shadow stack PTE. When a shadow stack page is being shared
    among processes (this happens at fork()), its PTE is made Dirty=0, so
    the next shadow stack access causes a fault, and the page is
    duplicated and Dirty=1 is set again. This is the COW equivalent for
    shadow stack pages, even though it's copy-on-access rather than
    copy-on-write.

(Write=0,Cow=0,Dirty=1):
  - A shadow stack PTE.
  - A Cow PTE created when a processor without shadow stack support set
    Dirty=1.

There are six bits left available to software in the 64-bit PTE after
consuming a bit for _PAGE_COW. No space is consumed in 32-bit kernels
because shadow stacks are not enabled there.

Implement only the infrastructure for _PAGE_COW. Changes to start
creating _PAGE_COW PTEs will follow once other pieces are in place.

Tested-by: Pengfei Xu <pengfei.xu@xxxxxxxxx>
Tested-by: John Allen <john.allen@xxxxxxx>
Co-developed-by: Yu-cheng Yu <yu-cheng.yu@xxxxxxxxx>
Signed-off-by: Yu-cheng Yu <yu-cheng.yu@xxxxxxxxx>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@xxxxxxxxx>
---

v5:
  - Fix log, comments and whitespace (Boris)
  - Remove capitalization on shadow stack (Boris)

v4:
  - Teach pte_flags_need_flush() about _PAGE_COW bit
  - Break apart patch for better bisectability

v3:
  - Add comment around _PAGE_TABLE in response to comment
    from (Andrew Cooper)
  - Check for PSE in pmd_shstk (Andrew Cooper)
  - Get to the point quicker in commit log (Andrew Cooper)
  - Clarify and reorder commit log for why the PTE bit examples have
    multiple entries. Apply same changes for comment. (peterz)
  - Fix comment that implied dirty bit for COW was a specific x86 thing
    (peterz)
  - Fix swapping of Write/Dirty (PeterZ)

v2:
  - Update commit log with comments (Dave Hansen)
  - Add comments in code to explain pte modification code better (Dave)
  - Clarify info on the meaning of various Write,Cow,Dirty combinations

  arch/x86/include/asm/pgtable.h       | 78 ++++++++++++++++++++++++++++
  arch/x86/include/asm/pgtable_types.h | 59 +++++++++++++++++++--
  arch/x86/include/asm/tlbflush.h      |  3 +-
  3 files changed, 134 insertions(+), 6 deletions(-)

diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index b39f16c0d507..6d2f612c04b5 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -301,6 +301,44 @@ static inline pte_t pte_clear_flags(pte_t pte, pteval_t clear)
  	return native_make_pte(v & ~clear);
  }
+/*
+ * Normally COW memory can result in Dirty=1,Write=0 PTEs. But in the case
+ * of X86_FEATURE_USER_SHSTK, the software COW bit is used, since the
+ * Dirty=1,Write=0 will result in the memory being treated as shadow stack
+ * by the HW. So when creating COW memory, a software bit is used
+ * _PAGE_BIT_COW. The following functions pte_mkcow() and pte_clear_cow()
+ * take a PTE marked conventionally COW (Dirty=1) and transition it to the
+ * shadow stack compatible version of COW (Cow=1).
+ */

TBH, I find that all highly confusing.

Dirty=1,Write=0 does not indicate a COW page reliably. You could have both, false negatives and false positives.

False negative: fork() on a clean anon page.

False positives: wrpotect() of a dirty anon page.


I wonder if it really has to be that complicated: what you really want to achieve is to disallow "Dirty=1,Write=0" if it's not a shadow stack page, correct?

--
Thanks,

David / dhildenb




[Index of Archives]     [Linux Kernel]     [Kernel Newbies]     [x86 Platform Driver]     [Netdev]     [Linux Wireless]     [Netfilter]     [Bugtraq]     [Linux Filesystems]     [Yosemite Discussion]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Samba]     [Device Mapper]

  Powered by Linux