On Wed, May 22, 2024 at 12:10:30PM -0400, Peter Xu wrote: > On Wed, May 22, 2024 at 05:34:21PM +0200, David Hildenbrand wrote: > > On 22.05.24 17:18, Peter Xu wrote: > > > On Wed, May 22, 2024 at 09:48:51AM +0200, David Hildenbrand wrote: > > > > On 22.05.24 00:36, Peter Xu wrote: > > > > > On Wed, May 22, 2024 at 03:21:04AM +0500, Mikhail Gavrilov wrote: > > > > > > On Wed, May 22, 2024 at 2:37 AM Peter Xu <peterx@xxxxxxxxxx> wrote: > > > > > > > Hmm I still cannot reproduce. Weird. > > > > > > > > > > > > > > Would it be possible for you to identify which line in debug_vm_pgtable.c > > > > > > > triggered that issue? > > > > > > > > > > > > > > I think it should be some set_pte_at() but I'm not sure, as there aren't a > > > > > > > lot and all of them look benign so far. It could be that I missed > > > > > > > something important. > > > > > > > > > > > > I hope it's helps: > > > > > > > > > > Thanks for offering this, it's just that it doesn't look coherent with what > > > > > was reported for some reason. > > > > > > > > > > > > > > > > > > sh /usr/src/kernels/(uname -r)/scripts/faddr2line /lib/debug/lib/modules/(uname -r)/vmlinux debug_vm_pgtable+0x1c04 > > > > > > debug_vm_pgtable+0x1c04/0x3360: > > > > > > native_ptep_get_and_clear at arch/x86/include/asm/pgtable_64.h:94 > > > > > > (inlined by) ptep_get_and_clear at arch/x86/include/asm/pgtable.h:1262 > > > > > > (inlined by) ptep_clear at include/linux/pgtable.h:509 > > > > > > > > > > This is a pte_clear(), and pte_clear() shouldn't even do the set() checks, > > > > > and shouldn't stumble over what I added. > > > > > > > > > > IOW, it doesn't match with the real stack dump previously: > > > > > > > > > > [ 5.581003] ? __page_table_check_ptes_set+0x306/0x3c0 > > > > > [ 5.581274] ? __pfx___page_table_check_ptes_set+0x10/0x10 > > > > > [ 5.581544] ? __pfx_check_pgprot+0x10/0x10 > > > > > [ 5.581806] set_ptes.constprop.0+0x66/0xd0 > > > > > [ 5.582072] ? __pfx_set_ptes.constprop.0+0x10/0x10 > > > > > [ 5.582333] ? __pfx_pte_val+0x10/0x10 > > > > > [ 5.582595] debug_vm_pgtable+0x1c04/0x3360 > > > > > > > > > > > > > Staring at pte_clear_tests(): > > > > > > > > #ifndef CONFIG_RISCV > > > > pte = __pte(pte_val(pte) | RANDOM_ORVALUE); > > > > #endif > > > > set_pte_at(args->mm, args->vaddr, args->ptep, pte); > > > > > > > > So we set random PTE bits, probably setting the present, uffd and write bit > > > > at the same time. That doesn't make too much sense when we want to perform > > > > that such combinations cannot exist. > > > > > > Here the issue is I don't think it should set W bit anyway, as we init > > > page_prot to be RWX but !shared: > > > > > > args->page_prot = vm_get_page_prot(VM_ACCESS_FLAGS); > > > > > > On x86_64 (Mikhail's system) it should have W bit cleared afaict, meanwhile > > > the RANDOM_ORVALUE won't touch bit W due to S390_SKIP_MASK (which contains > > > bit W / bit 1, which is another "accident"..). Then even if with that it > > > should not trigger.. I think that's also why I cannot reproduce this > > > problem locally. > > > > Why oh why are skip mask applied independently of the architecture. > > > > While _PAGE_RW should indeed be masked out by RANDOM_ORVALUE. > > > > But with shadow stacks we consider a PTE writable (see > > pte_write()->pte_shstk()) if > > (1) X86_FEATURE_SHSTK is enabled > > (2) _PAGE_RW is clear > > (3) _PAGE_DIRTY is set > > > > _PAGE_DIRTY is bit 6. > > > > Likely your CPU does not support shadow stacks. > > Good point. My host has it, but I tested in the VM which doesn't. I > suppose we can wait and double check whether Mikhail should see the issue > went away with that patch provided. > > In this case, instead of keep fiddling with random bits to apply and > further work on top of per-arch random bits, I'd hope we can simply drop > that random mechanism as I don't think it'll be pxx_none() now. I attached > a patch I plan to post. Does it look reasonable? > > I also copied Anshuman, Gavin and Aneesh. No I didn't.. this one will.. > > Thanks, > > ===8<=== > From c10cde00b14d2d305390dd418a8a8855d3e6437f Mon Sep 17 00:00:00 2001 > From: Peter Xu <peterx@xxxxxxxxxx> > Date: Wed, 22 May 2024 12:04:33 -0400 > Subject: [PATCH] drop RANDOM_ORVALUE bits > > Signed-off-by: Peter Xu <peterx@xxxxxxxxxx> > --- > mm/debug_vm_pgtable.c | 30 ++++-------------------------- > 1 file changed, 4 insertions(+), 26 deletions(-) > > diff --git a/mm/debug_vm_pgtable.c b/mm/debug_vm_pgtable.c > index f1c9a2c5abc0..b5d7be05063a 100644 > --- a/mm/debug_vm_pgtable.c > +++ b/mm/debug_vm_pgtable.c > @@ -40,22 +40,7 @@ > * Please refer Documentation/mm/arch_pgtable_helpers.rst for the semantics > * expectations that are being validated here. All future changes in here > * or the documentation need to be in sync. > - * > - * On s390 platform, the lower 4 bits are used to identify given page table > - * entry type. But these bits might affect the ability to clear entries with > - * pxx_clear() because of how dynamic page table folding works on s390. So > - * while loading up the entries do not change the lower 4 bits. It does not > - * have affect any other platform. Also avoid the 62nd bit on ppc64 that is > - * used to mark a pte entry. > */ > -#define S390_SKIP_MASK GENMASK(3, 0) > -#if __BITS_PER_LONG == 64 > -#define PPC64_SKIP_MASK GENMASK(62, 62) > -#else > -#define PPC64_SKIP_MASK 0x0 > -#endif > -#define ARCH_SKIP_MASK (S390_SKIP_MASK | PPC64_SKIP_MASK) > -#define RANDOM_ORVALUE (GENMASK(BITS_PER_LONG - 1, 0) & ~ARCH_SKIP_MASK) > #define RANDOM_NZVALUE GENMASK(7, 0) > > struct pgtable_debug_args { > @@ -511,8 +496,7 @@ static void __init pud_clear_tests(struct pgtable_debug_args *args) > return; > > pr_debug("Validating PUD clear\n"); > - pud = __pud(pud_val(pud) | RANDOM_ORVALUE); > - WRITE_ONCE(*args->pudp, pud); > + WARN_ON(pud_none(pud)); > pud_clear(args->pudp); > pud = READ_ONCE(*args->pudp); > WARN_ON(!pud_none(pud)); > @@ -548,8 +532,7 @@ static void __init p4d_clear_tests(struct pgtable_debug_args *args) > return; > > pr_debug("Validating P4D clear\n"); > - p4d = __p4d(p4d_val(p4d) | RANDOM_ORVALUE); > - WRITE_ONCE(*args->p4dp, p4d); > + WARN_ON(p4d_none(p4d)); > p4d_clear(args->p4dp); > p4d = READ_ONCE(*args->p4dp); > WARN_ON(!p4d_none(p4d)); > @@ -582,8 +565,7 @@ static void __init pgd_clear_tests(struct pgtable_debug_args *args) > return; > > pr_debug("Validating PGD clear\n"); > - pgd = __pgd(pgd_val(pgd) | RANDOM_ORVALUE); > - WRITE_ONCE(*args->pgdp, pgd); > + WARN_ON(pgd_none(pgd)); > pgd_clear(args->pgdp); > pgd = READ_ONCE(*args->pgdp); > WARN_ON(!pgd_none(pgd)); > @@ -634,9 +616,6 @@ static void __init pte_clear_tests(struct pgtable_debug_args *args) > if (WARN_ON(!args->ptep)) > return; > > -#ifndef CONFIG_RISCV > - pte = __pte(pte_val(pte) | RANDOM_ORVALUE); > -#endif > set_pte_at(args->mm, args->vaddr, args->ptep, pte); > flush_dcache_page(page); > barrier(); > @@ -650,8 +629,7 @@ static void __init pmd_clear_tests(struct pgtable_debug_args *args) > pmd_t pmd = READ_ONCE(*args->pmdp); > > pr_debug("Validating PMD clear\n"); > - pmd = __pmd(pmd_val(pmd) | RANDOM_ORVALUE); > - WRITE_ONCE(*args->pmdp, pmd); > + WARN_ON(pmd_none(pmd)); > pmd_clear(args->pmdp); > pmd = READ_ONCE(*args->pmdp); > WARN_ON(!pmd_none(pmd)); > -- > 2.45.0 > > -- > Peter Xu -- Peter Xu