Nadav! On Tue, May 16 2023 at 18:23, Nadav Amit wrote: >> On May 16, 2023, at 5:23 PM, Thomas Gleixner <tglx@xxxxxxxxxxxxx> wrote: >>> I'm not ignoring them and I'm well aware of these issues. No need to >>> repeat them over and over. I'm old but not senile yet. > > Thomas, no disrespect was intended. I initially just sent the link and I > had a sense (based on my past experience) that nobody clicked on it. All good. >> It makes a whole lot of a difference whether you do 5 IPIs in a row >> which all need to get a cache line updated or if you have _one_ which >> needs a couple of cache lines updated. > > Obviously, if the question is 5 IPIs or 1 IPI with more flushing data, > the 1 IPI wins. The question I was focusing on is whether 1 IPI with > potentially global flush or detailed list of ranges to flush. Correct and there is obviously a tradeoff too, which has yet to be determined. >> INVLPG is not serializing so the CPU can pull in the next required cache >> line(s) on the VA list during that. > > Indeed, but ChatGPT says (yes, I see you making fun of me already): > “however, this doesn't mean INVLPG has no impact on the pipeline. INVLPG > can cause a pipeline stall because the TLB entry invalidation must be > completed before subsequent instructions that might rely on the TLB can > be executed correctly.” > > So I am not sure that your claim is exactly correct. Key is a subsequent instruction which might depend on the to be flushed TLB entry. That's obvious, but I'm having a hard time to construct that dependent intruction in this case. >> These cache lines are _not_ >> contended at that point because _all_ of these data structures are not >> longer globally accessible (mis-speculation aside) and therefore not >> exclusive (misalignment aside, but you have to prove that this is an >> issue). > > This is not entirely true. Indeed whether you have 1 remote core or N > remote core is not a whole issue (putting aside NUMA). But you will get > first a snoop to the initiator cache by the responding core, and then, > after the TLB invalidation is completed, an RFO by the initiator once > it writes to the cache again. If the invalidation data is on the stack > (as you did), this is even more likely to happen shortly after. That's correct and there might be smarter ways to handle that list muck. >> So just dismissing this on 10 years old experience is not really >> helpful, though I'm happy to confirm your points once I had the time and >> opportunity to actually run real testing over it, unless you beat me to >> it. > > I really don’t know what “dismissing” you are talking about. Sorry, I was overreacting due to increased grumpiness. > I do have relatively recent experience with the overhead of caching > effects on TLB shootdown time. It can become very apparent. You can > find some numbers in, for instance, the patch of mine I quoted in my > previous email. > > There are additional opportunities to reduce the caching effect for > x86, such as combining the SMP-code metadata with the TLB-invalidation > metadata (which is out of the scope) that I saw having performance > benefit. That’s all to say that caching effect is not something to > be considered obsolete. I never claimed that it does not matter. That's surely part of a decision making to investigate that. >> The point is that the generic vmalloc code is making assumptions which >> are x86 centric on not even necessarily true on x86. >> >> Whether or not this is benefitial on x86 that's a completey separate >> debate. > > I fully understand that if you reduce multiple TLB shootdowns (IPI-wise) > into 1, it is (pretty much) all benefit and there is no tradeoff. I was > focusing on the question of whether it is beneficial also to do precise > TLB flushing, and the tradeoff there is less clear (especially that the > kernel uses 2MB pages). For the vmalloc() area mappings? Not really. > My experience with non-IPI based TLB invalidations is more limited. IIUC > the usage model is that the TLB shootdowns should be invoked ASAP > (perhaps each range can be batched, but there is no sense of batching > multiple ranges), and then later you would issue some barrier to ensure > prior TLB shootdown invocations have been completed. > > If that is the (use) case, I am not sure the abstraction you used in > your prototype is the best one. The way how arm/arm64 implement that in software is: magic_barrier1(); flush_range_with_magic_opcodes(); magic_barrier2(); And for that use case having the list with individual ranges is not really wrong. Maybe ARM[64] could do this smarter, but that would require to rewrite a lot of code I assume. >> There is also a debate required whether a wholesale "flush on _ALL_ >> CPUs' is justified when some of those CPUs are completely isolated and >> have absolutely no chance to be affected by that. This process bound >> seccomp/BPF muck clearly does not justify to kick isolated CPUs out of >> their computation in user space just because… > > I hope you would excuse my ignorance (I am sure you won’t), but isn’t > the seccomp/BPF VMAP ranges are mapped on all processes (considering > PTI of course)? Are you suggesting you want a per-process kernel > address space? (which can make senes, I guess) Right. The BPF muck is mapped in the global kernel space, but e.g. the seccomp filters are individual per process. At least that's how I understand it, but I might be completely wrong. Thanks, tglx