[ Let's try this again, without the html crud this time. Apologies to the people who see this reply twice ] On Sat, Jan 8, 2022 at 2:04 PM Andy Lutomirski <luto@xxxxxxxxxx> wrote: > > So this requires that all architectures actually walk all relevant > CPUs to see if an IPI is needed and send that IPI. On architectures > that actually need an IPI anyway (x86 bare metal, powerpc (I think) > and others, fine. But on architectures with a broadcast-to-all-CPUs > flush (ARM64 IIUC), then the extra IPI will be much much slower than a > simple load-acquire in a loop. ... hmm. How about a hybrid scheme? (a) architectures that already require that IPI anyway for TLB invalidation (ie x86, but others too), just make the rule be that the TLB flush by exit_mmap() get rid of any lazy TLB mm references. Which they already do. (b) architectures like arm64 that do hw-assisted TLB shootdown will have an ASID allocator model, and what you do is to use that to either (b') increment/decrement the mm_count at mm ASID allocation/freeing time (b'') use the existing ASID tracking data to find the CPU's that have that ASID (c) can you really imagine hardware TLB shootdown without ASID allocation? That doesn't seem to make sense. But if it exists, maybe that kind of crazy case would do the percpu array walking. (Honesty in advertising: I don't know the arm64 ASID code - I used to know the old alpha version I wrote in a previous lifetime - but afaik any ASID allocator has to be able to track CPU's that have a particular ASID in use and be able to invalidate it). Hmm. The x86 maintainers are on this thread, but they aren't even the problem. Adding Catalin and Will to this, I think they should know if/how this would fit with the arm64 ASID allocator. Will/Catalin, background here: https://lore.kernel.org/all/CAHk-=wj4LZaFB5HjZmzf7xLFSCcQri-WWqOEJHwQg0QmPRSdQA@xxxxxxxxxxxxxx/ for my objection to that special "keep non-refcounted magic per-cpu pointer to lazy tlb mm". Linus