>> I suspect AMD wouldn't tell us exactly ;) > > Well, ideally they would just tell us the conditions under which CPUs > respond to the broadcast TLB flush or the expectations around latency. [Resend, complete this time] Disclaimer. I'm not at AMD; I don't know how they implement it; I'm just a random person on the internet. But, here are a few things that might be relevant to know. AMD's SEV-SNP whitepaper [1] states that RMP permissions "are cached in the CPU TLB and related structures" and also "When required, hardware automatically performs TLB invalidations to ensure that all processors in the system see the updated RMP entry information." That sentence doesn't use "broadcast" or "remote", but "all processors" is a pretty clear clue. Broadcast TLB invalidations are a building block of all the RMP-manipulation instructions. Furthermore, to be useful in this context, they need to be ordered with memory. Specifically, a new pagewalk mustn't start after an invalidation, yet observe the stale RMP entry. x86 CPUs do have reasonable forward-progress guarantees, but in order to achieve forward progress, they need to e.g. guarantee that one memory access doesn't displace the TLB entry backing a different memory access from the same instruction, or you could livelock while trying to complete a single instruction. A consequence is that you can't safely invalidate a TLB entry of an in-progress instruction (although this means only the oldest instruction in the pipeline, because everything else is speculative and potentially transient). INVLPGB invalidations are interrupt-like from the point of view of the remote core, but are microarchitectural and can be taken irrespective of the architectural Interrupt and Global Interrupt Flags. As a consequence, they'll need wait until an instruction boundary to be processed. While not AMD, the Intel RAR whitepaper [2] discusses the handling of RARs on the remote processor, and they share a number of constraints in common with INVLPGB. Overall, I'd expect the INVLPGB instructions to be pretty quick in and of themselves; interestingly, they're not identified as architecturally serialising. The broadcast is probably posted, and will be dealt with by remote processors on the subsequent instruction boundary. TLBSYNC is the barrier to wait until the invalidations have been processed, and this will block for an unspecified length of time, probably bounded by the "longest" instruction in progress on a remote CPU. e.g. I expect it probably will suck if you have to wait for a WBINVD instruction to complete on a remote CPU. That said, architectural IPIs have the same conditions too, except on top of that you've got to run a whole interrupt handler. So, with reasonable confidence, however slow TLBSYNC might be in the worst case, it's got absolutely nothing on the overhead of doing invalidations the old fashioned way. ~Andrew [1] https://www.amd.com/content/dam/amd/en/documents/epyc-business-docs/white-papers/SEV-SNP-strengthening-vm-isolation-with-integrity-protection-and-more.pdf [2] https://www.intel.com/content/dam/develop/external/us/en/documents/341431-remote-action-request-white-paper.pdf