On Tue, Aug 27, 2019 at 02:18:12PM +0100, Will Deacon wrote: > Hi all, Hi Will, For the series: Reviewed-by: Mark Rutland <mark.rutland@xxxxxxx> Thanks, Mark. > > [+linux-arch since the end of this may be applicable to other architectures] > > Commit 24fe1b0efad4fcdd ("arm64: Remove unnecessary ISBs from > set_{pte,pmd,pud") removed ISB instructions immediately following updates to > the page table, on the grounds that they are not required by the > architecture and a DSB alone is sufficient to ensure that subsequent data > accesses use the new translation: > > DDI0487E_a, B2-128: > > | ... no instruction that appears in program order after the DSB instruction > | can alter any state of the system or perform any part of its functionality > | until the DSB completes other than: > | > | * Being fetched from memory and decoded > | * Reading the general-purpose, SIMD and floating-point, Special-purpose, or > | System registers that are directly or indirectly read without causing > | side-effects. > > However, the same document also states the following: > > DDI0487E_a, B2-125: > > | DMB and DSB instructions affect reads and writes to the memory system > | generated by Load/Store instructions and data or unified cache maintenance > | instructions being executed by the PE. Instruction fetches or accesses > | caused by a hardware translation table access are not explicit accesses. > > which appears to claim that the DSB alone is insufficient. Unfortunately, > some CPU designers have followed the second clause above, whereas in Linux > we've been relying on the first. This means that our mapping sequence: > > MOV X0, <valid pte> > STR X0, [Xptep] // Store new PTE to page table > DSB ISHST > LDR X1, [X2] // Translates using the new PTE > > can actually raise a translation fault on the load instruction because the > translation can be performed speculatively before the page table update and > then marked as "faulting" by the CPU. For user PTEs, this is ok because we > can handle the spurious fault, but for kernel PTEs and intermediate table > entries this results in a panic(). > > We can fix this by reverting 24fe1b0efad4fcdd, but the fun doesn't stop > there. If we consider the unmap case, then a similar constraint applies to > ordering subsequent memory accesses after the completion of the TLB > invalidation, so we also need to add an ISB instruction to > __flush_tlb_kernel_pgtable(). For user addresses, the exception return > provides the necessary context synchronisation. > > This then raises an interesting question: if an ISB is required after a TLBI > instruction to prevent speculative translation of subsequent instructions, > how is this speculation prevented on concurrent CPUs that receive the > broadcast TLB invalidation message? Sending and completing a broadcast TLB > invalidation message does not imply execution of an ISB on the remote CPU, > however it /does/ require that the remote CPU will no longer make use of any > old translations because otherwise we wouldn't be able to guarantee that an > unmapped page could no longer be modified. In this regard, receiving a TLB > invalidation is in some ways stronger than sending one (where you need the > ISB). > > So far, so good, but the final piece of the puzzle isn't quite so rosy. > > *** Other architecture maintainers -- start here! *** > > In the case that one CPU maps a page and then sets a flag to tell another > CPU: > > CPU 0 > ----- > > MOV X0, <valid pte> > STR X0, [Xptep] // Store new PTE to page table > DSB ISHST > ISB > MOV X1, #1 > STR X1, [Xflag] // Set the flag > > CPU 1 > ----- > > loop: LDAR X0, [Xflag] // Poll flag with Acquire semantics > CBZ X0, loop > LDR X1, [X2] // Translates using the new PTE > > then the final load on CPU 1 can raise a translation fault for the same > reasons as mentioned at the start of this description. In reality, code > such as: > > CPU 0 CPU 1 > ----- ----- > spin_lock(&lock); spin_lock(&lock); > *ptr = vmalloc(size); if (*ptr) > spin_unlock(&lock); foo = **ptr; > spin_unlock(&lock); > > will not trigger the fault because there is an address dependency on > CPU1 which prevents the speculative translation. However, more exotic > code where the virtual address is known ahead of time, such as: > > CPU 0 CPU 1 > ----- ----- > spin_lock(&lock); spin_lock(&lock); > set_fixmap(0, paddr, prot); if (mapped) > mapped = true; foo = *fix_to_virt(0); > spin_unlock(&lock); spin_unlock(&lock); > > could fault. This can be avoided by any of: > > * Introducing broadcast TLB maintenance on the map path > * Adding a DSB;ISB sequence after checking a flag which indicates > that a virtual address is now mapped > * Handling the spurious fault > > Given that we have never observed a problem in the concurrent case under > Linux and future revisions of the architecture are being tightened so that > translation table walks are effectively ordered in the same way as explicit > memory accesses, we no longer treat spurious kernel faults as fatal if the > page table indicates that the access was valid. > > Anyway, this patch series attempts to implement some of this and I plan > to queue it for 5.4. > > Will > > Cc: Catalin Marinas <catalin.marinas@xxxxxxx> > Cc: Mark Rutland <mark.rutland@xxxxxxx> > Cc: Marc Zyngier <maz@xxxxxxxxxx> > Cc: Peter Zijlstra <peterz@xxxxxxxxxxxxx> > > --->8 > > Will Deacon (6): > Revert "arm64: Remove unnecessary ISBs from set_{pte,pmd,pud}" > arm64: tlb: Ensure we execute an ISB following walk cache invalidation > arm64: mm: Add ISB instruction to set_pgd() > arm64: sysreg: Add some field definitions for PAR_EL1 > arm64: mm: Ignore spurious translation faults taken from the kernel > arm64: kvm: Replace hardcoded '1' with SYS_PAR_EL1_F > > arch/arm64/include/asm/pgtable.h | 13 ++++++++++--- > arch/arm64/include/asm/sysreg.h | 3 +++ > arch/arm64/include/asm/tlbflush.h | 1 + > arch/arm64/kvm/hyp/switch.c | 2 +- > arch/arm64/mm/fault.c | 33 +++++++++++++++++++++++++++++++++ > 5 files changed, 48 insertions(+), 4 deletions(-) > > -- > 2.11.0 >