Resending as had wrong address for linux-doc and kvmarm. Apologies for spam. Hi All, This patch series seeks to gather feedback on adding initial support for level 2 of the Break-Before-Make arm64 architectural feature, specifically to contpte_convert(). This support reorders a TLB invalidation in contpte_convert(), and optionally elides said invalidation completely which leads to a 12% improvement when executing a microbenchmark designed to force the pathological path where contpte_convert() gets called. This represents an 80% reduction in the cost of calling contpte_convert(). However, the elision of the invalidation is still pending review to ensure it is architecturally valid. Without it, the reodering also represents a performance improvement due to reducing thread contention, as there is a smaller time window for racing threads to see an invalid pagetable entry (especially if they already have a cached entry in their TLB that they are working off of). This series is based on v6.13-rc2 (fac04efc5c79). Break-Before-Make Level 2 ========================= Break-Before-Make (BBM) sequences ensure a consistent view of the page tables. They avoid TLB multi-hits and ensure atomicity and ordering guarantees. BBM level 0 simply defines the current use of page tables. When you want to change certain bits in a pte, you need to: - clear the pte - dsb() - issue a tlbi for the pte - dsb() - repaint the pte - dsb() When changing block size, or toggling the contiguous bit, we currently use this BBM level 0 sequence. With BBM level 2 support, however, we can relax the BBM sequence and benefit from a performance improvement. The hardware would then either automatically handle the TLB invalidations, or would take a TLB Conflict Abort Exception. This exception can either be a stage 1 or stage 2 exception, depending on whether stage 1 or stage 2 translations are in use. The architecture currently mandates a worst-case invalidation of vmalle1 or vmalls12e1, when stage 2 translation is not in-use and in-use respectively. Outstanding Questions and Remaining TODOs ========================================= Patch 4 moves the tlbi so that the window where the pte is invalid is significantly smaller. This reduces the chances of racing threads accessing the memory during the window and taking a fault. This is confirmed to be architecturally sound. Patch 5 removes the tlbi entirely. This has the benefit of significantly reducing the cost of contpte_convert(). While testing has demonstrated that this works as expected on Arm-designed CPUs, we are still in the process of confirming whether it is architecturally correct. I am requesting review while that process is on-going. Patch 5 would be dropped if it turns out to be architecturally unsound. Another note is that the stage 2 TLB conflict handling is included as patch 1 of this series. This patch could (and probably should) be sent separately as it may be useful outside this series, but is included for reference. Thanks, Miko Mikołaj Lenczewski (5): arm64: Add TLB Conflict Abort Exception handler to KVM arm64: Add BBM Level 2 cpu feature arm64: Add errata and workarounds for systems with broken BBML2 arm64/mm: Delay tlbi in contpte_convert() under BBML2 arm64/mm: Elide tlbi in contpte_convert() under BBML2 Documentation/arch/arm64/silicon-errata.rst | 32 ++++ arch/arm64/Kconfig | 164 ++++++++++++++++++++ arch/arm64/include/asm/cpufeature.h | 14 ++ arch/arm64/include/asm/esr.h | 8 + arch/arm64/kernel/cpufeature.c | 37 +++++ arch/arm64/kvm/mmu.c | 6 + arch/arm64/mm/contpte.c | 3 +- arch/arm64/mm/fault.c | 27 +++- arch/arm64/tools/cpucaps | 1 + 9 files changed, 290 insertions(+), 2 deletions(-) -- 2.45.2