----- pbonzini@xxxxxxxxxx wrote: > On 14/05/2018 16:32, Liran Alon wrote: > > > > Eventually I implemented (2) just because it was simpler to > implement > > and gave good enough performance for our needs. It could also be > > further optimized by having a per-PCID array of X VAs to > > virtual-invlpg on next switch to that PCID. If array is full, > > fallback to treat cr3-switch as done with "TLB flush" even if bit > 63 > > is set. > > > > The above approach worked pretty-well and this made me wonder how > the > > approach suggested by this patch series compares to my approach. I > > think there is an interesting trade-off between these 2 approaches: > > > My approach is simpler and support PCID & INVPCID generically > without > > limiting performance gain to specific workload of KPTI usage of > PCID. > > For example, guest that use PCIDs for minimizing TLB flushes (Linux > > has such mechanism) would have improved performance. However, the > > main drawback of this approach is that GPTs mapping user-space > > portion of process will have multiple SPTs mirroring them for > > user/kernel PCIDs. Therefore, this approach utilize more SPT pages > > and therefore could result in more MMU flushes. This is in contrast > > to this patch series which can share SPTs mirroring the GPTs mapping > > > user-space by both user PCID and kernel PCID. > > > > Turned out to be quite long but wanted to just give my two cents > > analysis on this series. :) > > I agree that your approach should result in less code. I am not sure > where you stored the PCID bits, but if it's the hash key, then we > might > as well store fewer and force a TLB flush when there is a conflict. > This should limit the amount of memory used by SPTEs. To which PCID bits are you referring? For every SPT I have added a 12-bit field for storing active PCID when SPT was created. Unrelated, to support INVPCID, I added a per-vCPU bitmap of 0x1000 bits (512 bytes) that signals to which PCIDs on next switch to them, we should force a SPT sync & TLB flush even if new CR3 bit 63 is set. Because the bitmap is per-vCPU, it doesn't take memory from SPTs. The reason I said my approach consumes more SPT pages is because SPTs mirroring the same GPT in same paging-mode and same paging-level are duplicated between different PCIDs. Also, I didn't quite got your opinion regarding the trade-off between the different approaches. > > Paolo