On Tue, Oct 22, 2024 at 10:15 AM Marc Zyngier <maz@xxxxxxxxxx> wrote: > > On Mon, 21 Oct 2024 05:22:16 +0100, > Yu Zhao <yuzhao@xxxxxxxxxx> wrote: > > > > Broadcast pseudo-NMI IPIs to pause remote CPUs for a short period of > > time, and then reliably resume them when the local CPU exits critical > > sections that preclude the execution of remote CPUs. > > > > A typical example of such critical sections is BBM on kernel PTEs. > > HugeTLB Vmemmap Optimization (HVO) on arm64 was disabled by > > commit 060a2c92d1b6 ("arm64: mm: hugetlb: Disable > > HUGETLB_PAGE_OPTIMIZE_VMEMMAP") due to the folllowing reason: > > > > This is deemed UNPREDICTABLE by the Arm architecture without a > > break-before-make sequence (make the PTE invalid, TLBI, write the > > new valid PTE). However, such sequence is not possible since the > > vmemmap may be concurrently accessed by the kernel. > > > > Supporting BBM on kernel PTEs is one of the approaches that can make > > HVO theoretically safe on arm64. > > Is the safety only theoretical? I would have expected that we'd use an > approach that is absolutely rock-solid. We've been trying to construct a repro against the original HVO (missing BBM), but so far no success. Hopefully a repro does exist, and then we'd be able to demonstrate the effectiveness of this series, which is only theoretical at the moment. > > Note that it is still possible for the paused CPUs to perform > > speculative translations. Such translations would cause spurious > > kernel PFs, which should be properly handled by > > is_spurious_el1_translation_fault(). > > Speculative translation faults are never reported, that'd be a CPU > bug. Right, I meant to say "speculative accesses that cause translations". > *Spurious* translation faults can be reported if the CPU doesn't > implement FEAT_ETS2, for example, and that has to do with the ordering > of memory access wrt page-table walking for the purpose of translations. Just want to make sure I fully understand: after the local CPU sends TLBI (followed by DSB & ISB), FEAT_ETS2 would act like DSB & ISB on remote CPUs when they perform the invalidation, and therefore speculative accesses are ordered as well on remote CPUs. Also I'm assuming IPIs on remote CPUs don't act like a full barrier, and whatever they interrupt still can be speculatively executed even though the IPI hander itself doesn't access the vmemmap area undergoing BBM. Is this correct? > > Signed-off-by: Yu Zhao <yuzhao@xxxxxxxxxx> > > --- > > arch/arm64/include/asm/smp.h | 3 ++ > > arch/arm64/kernel/smp.c | 92 +++++++++++++++++++++++++++++++++--- > > 2 files changed, 88 insertions(+), 7 deletions(-) > > > > diff --git a/arch/arm64/include/asm/smp.h b/arch/arm64/include/asm/smp.h > > index 2510eec026f7..cffb0cfed961 100644 > > --- a/arch/arm64/include/asm/smp.h > > +++ b/arch/arm64/include/asm/smp.h > > @@ -133,6 +133,9 @@ bool cpus_are_stuck_in_kernel(void); > > extern void crash_smp_send_stop(void); > > extern bool smp_crash_stop_failed(void); > > > > +void pause_remote_cpus(void); > > +void resume_remote_cpus(void); > > + > > #endif /* ifndef __ASSEMBLY__ */ > > > > #endif /* ifndef __ASM_SMP_H */ > > diff --git a/arch/arm64/kernel/smp.c b/arch/arm64/kernel/smp.c > > index 3b3f6b56e733..68829c6de1b1 100644 > > --- a/arch/arm64/kernel/smp.c > > +++ b/arch/arm64/kernel/smp.c > > @@ -85,7 +85,12 @@ static int ipi_irq_base __ro_after_init; > > static int nr_ipi __ro_after_init = NR_IPI; > > static struct irq_desc *ipi_desc[MAX_IPI] __ro_after_init; > > > > -static bool crash_stop; > > +enum { > > + SEND_STOP = BIT(0), > > + CRASH_STOP = BIT(1), > > +}; > > + > > +static unsigned long stop_in_progress; > > > > static void ipi_setup(int cpu); > > > > @@ -917,6 +922,79 @@ static void __noreturn ipi_cpu_crash_stop(unsigned int cpu, struct pt_regs *regs > > #endif > > } > > > > +static DEFINE_SPINLOCK(cpu_pause_lock); > > PREEMPT_RT will turn this into a sleeping lock. Is it safe to sleep as > you are dealing with kernel mappings? Right, it should be a raw spinlock -- the caller disabled preemption, which as you said is required when dealing with the kernel mappings. > > +static cpumask_t paused_cpus; > > +static cpumask_t resumed_cpus; > > + > > +static void pause_local_cpu(void) > > +{ > > + int cpu = smp_processor_id(); > > + > > + cpumask_clear_cpu(cpu, &resumed_cpus); > > + /* > > + * Paired with pause_remote_cpus() to confirm that this CPU not only > > + * will be paused but also can be reliably resumed. > > + */ > > + smp_wmb(); > > + cpumask_set_cpu(cpu, &paused_cpus); > > + /* paused_cpus must be set before waiting on resumed_cpus. */ > > + barrier(); > > I'm not sure what this is trying to enforce. Yes, the compiler won't > reorder the set and the test. Sorry I don't follow: does cpumask_set_cpu(), i.e., set_bit(), already contain a compiler barrier? My understanding is that the compiler is free to reorder the set and test on those two independent variables, and make it like this: while (!cpumask_test_cpu(cpu, &resumed_cpus)) cpu_relax(); cpumask_set_cpu(cpu, &paused_cpus); So the CPU sent the IPI would keep waiting on paused_cpus being set, and this CPU would keep waiting on resumed_cpus being set, which would end up with a deadlock. > But your comment seems to indicate that > also need to make sure the CPU preserves that ordering > and short of a > DMB, the test below could be reordered. If this CPU reorders the set and test like above, it wouldn't be a problem because the set would eventually appear on the other CPU that sent the IPI. > > + while (!cpumask_test_cpu(cpu, &resumed_cpus)) > > + cpu_relax(); > > + /* A typical example for sleep and wake-up functions. */ > > I'm not sure this is "typical",... Sorry, this full barrier isn't needed. Apparently I didn't properly fix this from the previous attempt to use wfe()/sev() to make this function the sleeper for resume_remote_cpus() to wake up. > > + smp_mb(); > > + cpumask_clear_cpu(cpu, &paused_cpus); > > +} > > + > > +void pause_remote_cpus(void) > > +{ > > + cpumask_t cpus_to_pause; > > + > > + lockdep_assert_cpus_held(); > > + lockdep_assert_preemption_disabled(); > > + > > + cpumask_copy(&cpus_to_pause, cpu_online_mask); > > + cpumask_clear_cpu(smp_processor_id(), &cpus_to_pause); > > This bitmap is manipulated outside of your cpu_pause_lock. What > guarantees you can't have two CPUs stepping on each other here? Do you mean cpus_to_pause? If so, that's a local bitmap. > > + > > + spin_lock(&cpu_pause_lock); > > + > > + WARN_ON_ONCE(!cpumask_empty(&paused_cpus)); > > + > > + smp_cross_call(&cpus_to_pause, IPI_CPU_STOP_NMI); > > + > > + while (!cpumask_equal(&cpus_to_pause, &paused_cpus)) > > + cpu_relax(); > > This can be a lot of things to compare, specially that you are > explicitly mentioning large systems. Why can't this be implemented as > a counter instead? Agreed - that'd be sufficient and simpler. > Overall, this looks like stop_machine() in disguise. Why can't this > use the existing infrastructure? This came up during the previous discussion [1]. There are similarities. The main concern with reusing stop_machine() (or part of its current implementation) is that it may not meet the performance requirements. Refactoring might be possible, however, it seems to me (after checking the code again) that such a refactoring unlikely ends up cleaner or simpler codebase, especially after I got rid of CPU masks, as you suggested. [1] https://lore.kernel.org/all/ZbKjHHeEdFYY1xR5@xxxxxxx/ diff --git a/arch/arm64/kernel/smp.c b/arch/arm64/kernel/smp.c index 3b3f6b56e733..b672af2441a3 100644 --- a/arch/arm64/kernel/smp.c +++ b/arch/arm64/kernel/smp.c @@ -917,6 +922,64 @@ static void __noreturn ipi_cpu_crash_stop(unsigned int cpu, struct pt_regs *regs #endif } +static DEFINE_RAW_SPINLOCK(cpu_pause_lock); +static bool __cacheline_aligned_in_smp cpu_paused; +static atomic_t __cacheline_aligned_in_smp nr_cpus_paused; + +static void pause_local_cpu(void) +{ + atomic_inc(&nr_cpus_paused); + + while (READ_ONCE(cpu_paused)) + cpu_relax(); + + atomic_dec(&nr_cpus_paused); +} + +void pause_remote_cpus(void) +{ + cpumask_t cpus_to_pause; + int nr_cpus_to_pause = num_online_cpus() - 1; + + lockdep_assert_cpus_held(); + lockdep_assert_preemption_disabled(); + + if (!nr_cpus_to_pause) + return; + + cpumask_copy(&cpus_to_pause, cpu_online_mask); + cpumask_clear_cpu(smp_processor_id(), &cpus_to_pause); + + raw_spin_lock(&cpu_pause_lock); + + WARN_ON_ONCE(cpu_paused); + WARN_ON_ONCE(atomic_read(&nr_cpus_paused)); + + cpu_paused = true; + + smp_cross_call(&cpus_to_pause, IPI_CPU_STOP_NMI); + + while (atomic_read(&nr_cpus_paused) != nr_cpus_to_pause) + cpu_relax(); + + raw_spin_unlock(&cpu_pause_lock); +} + +void resume_remote_cpus(void) +{ + if (!cpu_paused) + return; + + raw_spin_lock(&cpu_pause_lock); + + WRITE_ONCE(cpu_paused, false); + + while (atomic_read(&nr_cpus_paused)) + cpu_relax(); + + raw_spin_unlock(&cpu_pause_lock); +} + static void arm64_backtrace_ipi(cpumask_t *mask) { __ipi_send_mask(ipi_desc[IPI_CPU_BACKTRACE], mask);