On Wed, May 22, 2024 at 05:34:21PM +0200, David Hildenbrand wrote: > On 22.05.24 17:18, Peter Xu wrote: > > On Wed, May 22, 2024 at 09:48:51AM +0200, David Hildenbrand wrote: > > > On 22.05.24 00:36, Peter Xu wrote: > > > > On Wed, May 22, 2024 at 03:21:04AM +0500, Mikhail Gavrilov wrote: > > > > > On Wed, May 22, 2024 at 2:37 AM Peter Xu <peterx@xxxxxxxxxx> wrote: > > > > > > Hmm I still cannot reproduce. Weird. > > > > > > > > > > > > Would it be possible for you to identify which line in debug_vm_pgtable.c > > > > > > triggered that issue? > > > > > > > > > > > > I think it should be some set_pte_at() but I'm not sure, as there aren't a > > > > > > lot and all of them look benign so far. It could be that I missed > > > > > > something important. > > > > > > > > > > I hope it's helps: > > > > > > > > Thanks for offering this, it's just that it doesn't look coherent with what > > > > was reported for some reason. > > > > > > > > > > > > > > > sh /usr/src/kernels/(uname -r)/scripts/faddr2line /lib/debug/lib/modules/(uname -r)/vmlinux debug_vm_pgtable+0x1c04 > > > > > debug_vm_pgtable+0x1c04/0x3360: > > > > > native_ptep_get_and_clear at arch/x86/include/asm/pgtable_64.h:94 > > > > > (inlined by) ptep_get_and_clear at arch/x86/include/asm/pgtable.h:1262 > > > > > (inlined by) ptep_clear at include/linux/pgtable.h:509 > > > > > > > > This is a pte_clear(), and pte_clear() shouldn't even do the set() checks, > > > > and shouldn't stumble over what I added. > > > > > > > > IOW, it doesn't match with the real stack dump previously: > > > > > > > > [ 5.581003] ? __page_table_check_ptes_set+0x306/0x3c0 > > > > [ 5.581274] ? __pfx___page_table_check_ptes_set+0x10/0x10 > > > > [ 5.581544] ? __pfx_check_pgprot+0x10/0x10 > > > > [ 5.581806] set_ptes.constprop.0+0x66/0xd0 > > > > [ 5.582072] ? __pfx_set_ptes.constprop.0+0x10/0x10 > > > > [ 5.582333] ? __pfx_pte_val+0x10/0x10 > > > > [ 5.582595] debug_vm_pgtable+0x1c04/0x3360 > > > > > > > > > > Staring at pte_clear_tests(): > > > > > > #ifndef CONFIG_RISCV > > > pte = __pte(pte_val(pte) | RANDOM_ORVALUE); > > > #endif > > > set_pte_at(args->mm, args->vaddr, args->ptep, pte); > > > > > > So we set random PTE bits, probably setting the present, uffd and write bit > > > at the same time. That doesn't make too much sense when we want to perform > > > that such combinations cannot exist. > > > > Here the issue is I don't think it should set W bit anyway, as we init > > page_prot to be RWX but !shared: > > > > args->page_prot = vm_get_page_prot(VM_ACCESS_FLAGS); > > > > On x86_64 (Mikhail's system) it should have W bit cleared afaict, meanwhile > > the RANDOM_ORVALUE won't touch bit W due to S390_SKIP_MASK (which contains > > bit W / bit 1, which is another "accident"..). Then even if with that it > > should not trigger.. I think that's also why I cannot reproduce this > > problem locally. > > Why oh why are skip mask applied independently of the architecture. > > While _PAGE_RW should indeed be masked out by RANDOM_ORVALUE. > > But with shadow stacks we consider a PTE writable (see > pte_write()->pte_shstk()) if > (1) X86_FEATURE_SHSTK is enabled > (2) _PAGE_RW is clear > (3) _PAGE_DIRTY is set > > _PAGE_DIRTY is bit 6. > > Likely your CPU does not support shadow stacks. Good point. My host has it, but I tested in the VM which doesn't. I suppose we can wait and double check whether Mikhail should see the issue went away with that patch provided. In this case, instead of keep fiddling with random bits to apply and further work on top of per-arch random bits, I'd hope we can simply drop that random mechanism as I don't think it'll be pxx_none() now. I attached a patch I plan to post. Does it look reasonable? I also copied Anshuman, Gavin and Aneesh. Thanks, ===8<=== >From c10cde00b14d2d305390dd418a8a8855d3e6437f Mon Sep 17 00:00:00 2001 From: Peter Xu <peterx@xxxxxxxxxx> Date: Wed, 22 May 2024 12:04:33 -0400 Subject: [PATCH] drop RANDOM_ORVALUE bits Signed-off-by: Peter Xu <peterx@xxxxxxxxxx> --- mm/debug_vm_pgtable.c | 30 ++++-------------------------- 1 file changed, 4 insertions(+), 26 deletions(-) diff --git a/mm/debug_vm_pgtable.c b/mm/debug_vm_pgtable.c index f1c9a2c5abc0..b5d7be05063a 100644 --- a/mm/debug_vm_pgtable.c +++ b/mm/debug_vm_pgtable.c @@ -40,22 +40,7 @@ * Please refer Documentation/mm/arch_pgtable_helpers.rst for the semantics * expectations that are being validated here. All future changes in here * or the documentation need to be in sync. - * - * On s390 platform, the lower 4 bits are used to identify given page table - * entry type. But these bits might affect the ability to clear entries with - * pxx_clear() because of how dynamic page table folding works on s390. So - * while loading up the entries do not change the lower 4 bits. It does not - * have affect any other platform. Also avoid the 62nd bit on ppc64 that is - * used to mark a pte entry. */ -#define S390_SKIP_MASK GENMASK(3, 0) -#if __BITS_PER_LONG == 64 -#define PPC64_SKIP_MASK GENMASK(62, 62) -#else -#define PPC64_SKIP_MASK 0x0 -#endif -#define ARCH_SKIP_MASK (S390_SKIP_MASK | PPC64_SKIP_MASK) -#define RANDOM_ORVALUE (GENMASK(BITS_PER_LONG - 1, 0) & ~ARCH_SKIP_MASK) #define RANDOM_NZVALUE GENMASK(7, 0) struct pgtable_debug_args { @@ -511,8 +496,7 @@ static void __init pud_clear_tests(struct pgtable_debug_args *args) return; pr_debug("Validating PUD clear\n"); - pud = __pud(pud_val(pud) | RANDOM_ORVALUE); - WRITE_ONCE(*args->pudp, pud); + WARN_ON(pud_none(pud)); pud_clear(args->pudp); pud = READ_ONCE(*args->pudp); WARN_ON(!pud_none(pud)); @@ -548,8 +532,7 @@ static void __init p4d_clear_tests(struct pgtable_debug_args *args) return; pr_debug("Validating P4D clear\n"); - p4d = __p4d(p4d_val(p4d) | RANDOM_ORVALUE); - WRITE_ONCE(*args->p4dp, p4d); + WARN_ON(p4d_none(p4d)); p4d_clear(args->p4dp); p4d = READ_ONCE(*args->p4dp); WARN_ON(!p4d_none(p4d)); @@ -582,8 +565,7 @@ static void __init pgd_clear_tests(struct pgtable_debug_args *args) return; pr_debug("Validating PGD clear\n"); - pgd = __pgd(pgd_val(pgd) | RANDOM_ORVALUE); - WRITE_ONCE(*args->pgdp, pgd); + WARN_ON(pgd_none(pgd)); pgd_clear(args->pgdp); pgd = READ_ONCE(*args->pgdp); WARN_ON(!pgd_none(pgd)); @@ -634,9 +616,6 @@ static void __init pte_clear_tests(struct pgtable_debug_args *args) if (WARN_ON(!args->ptep)) return; -#ifndef CONFIG_RISCV - pte = __pte(pte_val(pte) | RANDOM_ORVALUE); -#endif set_pte_at(args->mm, args->vaddr, args->ptep, pte); flush_dcache_page(page); barrier(); @@ -650,8 +629,7 @@ static void __init pmd_clear_tests(struct pgtable_debug_args *args) pmd_t pmd = READ_ONCE(*args->pmdp); pr_debug("Validating PMD clear\n"); - pmd = __pmd(pmd_val(pmd) | RANDOM_ORVALUE); - WRITE_ONCE(*args->pmdp, pmd); + WARN_ON(pmd_none(pmd)); pmd_clear(args->pmdp); pmd = READ_ONCE(*args->pmdp); WARN_ON(!pmd_none(pmd)); -- 2.45.0 -- Peter Xu