Hi all, [+linux-arch since the end of this may be applicable to other architectures] Commit 24fe1b0efad4fcdd ("arm64: Remove unnecessary ISBs from set_{pte,pmd,pud") removed ISB instructions immediately following updates to the page table, on the grounds that they are not required by the architecture and a DSB alone is sufficient to ensure that subsequent data accesses use the new translation: DDI0487E_a, B2-128: | ... no instruction that appears in program order after the DSB instruction | can alter any state of the system or perform any part of its functionality | until the DSB completes other than: | | * Being fetched from memory and decoded | * Reading the general-purpose, SIMD and floating-point, Special-purpose, or | System registers that are directly or indirectly read without causing | side-effects. However, the same document also states the following: DDI0487E_a, B2-125: | DMB and DSB instructions affect reads and writes to the memory system | generated by Load/Store instructions and data or unified cache maintenance | instructions being executed by the PE. Instruction fetches or accesses | caused by a hardware translation table access are not explicit accesses. which appears to claim that the DSB alone is insufficient. Unfortunately, some CPU designers have followed the second clause above, whereas in Linux we've been relying on the first. This means that our mapping sequence: MOV X0, <valid pte> STR X0, [Xptep] // Store new PTE to page table DSB ISHST LDR X1, [X2] // Translates using the new PTE can actually raise a translation fault on the load instruction because the translation can be performed speculatively before the page table update and then marked as "faulting" by the CPU. For user PTEs, this is ok because we can handle the spurious fault, but for kernel PTEs and intermediate table entries this results in a panic(). We can fix this by reverting 24fe1b0efad4fcdd, but the fun doesn't stop there. If we consider the unmap case, then a similar constraint applies to ordering subsequent memory accesses after the completion of the TLB invalidation, so we also need to add an ISB instruction to __flush_tlb_kernel_pgtable(). For user addresses, the exception return provides the necessary context synchronisation. This then raises an interesting question: if an ISB is required after a TLBI instruction to prevent speculative translation of subsequent instructions, how is this speculation prevented on concurrent CPUs that receive the broadcast TLB invalidation message? Sending and completing a broadcast TLB invalidation message does not imply execution of an ISB on the remote CPU, however it /does/ require that the remote CPU will no longer make use of any old translations because otherwise we wouldn't be able to guarantee that an unmapped page could no longer be modified. In this regard, receiving a TLB invalidation is in some ways stronger than sending one (where you need the ISB). So far, so good, but the final piece of the puzzle isn't quite so rosy. *** Other architecture maintainers -- start here! *** In the case that one CPU maps a page and then sets a flag to tell another CPU: CPU 0 ----- MOV X0, <valid pte> STR X0, [Xptep] // Store new PTE to page table DSB ISHST ISB MOV X1, #1 STR X1, [Xflag] // Set the flag CPU 1 ----- loop: LDAR X0, [Xflag] // Poll flag with Acquire semantics CBZ X0, loop LDR X1, [X2] // Translates using the new PTE then the final load on CPU 1 can raise a translation fault for the same reasons as mentioned at the start of this description. In reality, code such as: CPU 0 CPU 1 ----- ----- spin_lock(&lock); spin_lock(&lock); *ptr = vmalloc(size); if (*ptr) spin_unlock(&lock); foo = **ptr; spin_unlock(&lock); will not trigger the fault because there is an address dependency on CPU1 which prevents the speculative translation. However, more exotic code where the virtual address is known ahead of time, such as: CPU 0 CPU 1 ----- ----- spin_lock(&lock); spin_lock(&lock); set_fixmap(0, paddr, prot); if (mapped) mapped = true; foo = *fix_to_virt(0); spin_unlock(&lock); spin_unlock(&lock); could fault. This can be avoided by any of: * Introducing broadcast TLB maintenance on the map path * Adding a DSB;ISB sequence after checking a flag which indicates that a virtual address is now mapped * Handling the spurious fault Given that we have never observed a problem in the concurrent case under Linux and future revisions of the architecture are being tightened so that translation table walks are effectively ordered in the same way as explicit memory accesses, we no longer treat spurious kernel faults as fatal if the page table indicates that the access was valid. Anyway, this patch series attempts to implement some of this and I plan to queue it for 5.4. Will Cc: Catalin Marinas <catalin.marinas@xxxxxxx> Cc: Mark Rutland <mark.rutland@xxxxxxx> Cc: Marc Zyngier <maz@xxxxxxxxxx> Cc: Peter Zijlstra <peterz@xxxxxxxxxxxxx> --->8 Will Deacon (6): Revert "arm64: Remove unnecessary ISBs from set_{pte,pmd,pud}" arm64: tlb: Ensure we execute an ISB following walk cache invalidation arm64: mm: Add ISB instruction to set_pgd() arm64: sysreg: Add some field definitions for PAR_EL1 arm64: mm: Ignore spurious translation faults taken from the kernel arm64: kvm: Replace hardcoded '1' with SYS_PAR_EL1_F arch/arm64/include/asm/pgtable.h | 13 ++++++++++--- arch/arm64/include/asm/sysreg.h | 3 +++ arch/arm64/include/asm/tlbflush.h | 1 + arch/arm64/kvm/hyp/switch.c | 2 +- arch/arm64/mm/fault.c | 33 +++++++++++++++++++++++++++++++++ 5 files changed, 48 insertions(+), 4 deletions(-) -- 2.11.0