> On May 17, 2023, at 3:31 AM, Thomas Gleixner <tglx@xxxxxxxxxxxxx> wrote: > >>> The point is that the generic vmalloc code is making assumptions which >>> are x86 centric on not even necessarily true on x86. >>> >>> Whether or not this is benefitial on x86 that's a completey separate >>> debate. >> >> I fully understand that if you reduce multiple TLB shootdowns (IPI-wise) >> into 1, it is (pretty much) all benefit and there is no tradeoff. I was >> focusing on the question of whether it is beneficial also to do precise >> TLB flushing, and the tradeoff there is less clear (especially that the >> kernel uses 2MB pages). > > For the vmalloc() area mappings? Not really. The main penalty of doing a global flush are the innocent bystanders TLB translations. These are likely the regular mappings, not the malloc-ones. > >> My experience with non-IPI based TLB invalidations is more limited. IIUC >> the usage model is that the TLB shootdowns should be invoked ASAP >> (perhaps each range can be batched, but there is no sense of batching >> multiple ranges), and then later you would issue some barrier to ensure >> prior TLB shootdown invocations have been completed. >> >> If that is the (use) case, I am not sure the abstraction you used in >> your prototype is the best one. > > The way how arm/arm64 implement that in software is: > > magic_barrier1(); > flush_range_with_magic_opcodes(); > magic_barrier2(); > > And for that use case having the list with individual ranges is not > really wrong. > > Maybe ARM[64] could do this smarter, but that would require to rewrite a > lot of code I assume. What you say makes sense - and I actually see that flush_tlb_page_nosync() needs a memory barrier. I just encountered recent patches that did the flushing on ARM in an async manner as I described. That is the reason I assumed it is more efficient. https://lore.kernel.org/linux-mm/20230410134352.4519-3-yangyicong@xxxxxxxxxx/ > >>> There is also a debate required whether a wholesale "flush on _ALL_ >>> CPUs' is justified when some of those CPUs are completely isolated and >>> have absolutely no chance to be affected by that. This process bound >>> seccomp/BPF muck clearly does not justify to kick isolated CPUs out of >>> their computation in user space just because… >> >> I hope you would excuse my ignorance (I am sure you won’t), but isn’t >> the seccomp/BPF VMAP ranges are mapped on all processes (considering >> PTI of course)? Are you suggesting you want a per-process kernel >> address space? (which can make senes, I guess) > > Right. The BPF muck is mapped in the global kernel space, but e.g. the > seccomp filters are individual per process. At least that's how I > understand it, but I might be completely wrong. After rehashing the seccomp man page, they are not entirely “private” for each process, as they are maintained after fork/exec. Yet, one can imagine that it is possible to create non-global kernel mappings that would be mapped per-process and would hold the seccomp filters. This would remove the need to do system-wide flushes when the process dies.