> On Nov 7, 2023, at 9:00 AM, Palmer Dabbelt <palmer@xxxxxxxxxxx> wrote: > > On Mon, 30 Oct 2023 07:01:48 PDT (-0700), nadav.amit@xxxxxxxxx wrote: >> >>> On Oct 30, 2023, at 3:30 PM, Alexandre Ghiti <alexghiti@xxxxxxxxxxxx> wrote: >>> + on_each_cpu_mask(cmask, >>> + __ipi_flush_tlb_range_asid, >>> + &ftd, 1); >> >> Unrelated, but having fed > > Do you mean `ftd`? > > If so I'm not all that convinced that's a problem: sure it's 4x`long`, so we pass it on the stack instead of registers, but otherwise we'd need another `on_each_cpu_mask()` callback to shim stuff through via registers. I have no idea why you need to move stuff through the registers. >> Actually, it is best not to put it on the stack, if possible to reduce >> cache traffic. > > Sorry if I'm just missing something, but I'm not convinced this is a measurable performance problem. I am not going to try to convince you (I ran the numbers on x86 a long time ago). There is a cost of bouncing cache-lines (because multiple cores access the stack), TLB-miss on remote cores (which is mostly avoidable if ftd is global). Having said that, the optimizations you added now and intend to add in the next steps are definitely more important for performance.