Re: [PATCH 00/11] Shadow Paging performance improvements

Liran Alon <liran.alon@xxxxxxxxxx> · Mon, 14 May 2018 12:59:36 -0700 (PDT)

----- pbonzini@xxxxxxxxxx wrote:

> On 14/05/2018 16:32, Liran Alon wrote:
> > 
> > Eventually I implemented (2) just because it was simpler to
> implement
> > and gave good enough performance for our needs. It could also be
> > further optimized by having a per-PCID array of X VAs to
> > virtual-invlpg on next switch to that PCID. If array is full,
> > fallback to treat cr3-switch as done with "TLB flush" even if bit
> 63
> > is set.
> > 
> > The above approach worked pretty-well and this made me wonder how
> the
> > approach suggested by this patch series compares to my approach. I
> > think there is an interesting trade-off between these 2 approaches:
> 
> > My approach is simpler and support PCID & INVPCID generically
> without
> > limiting performance gain to specific workload of KPTI usage of
> PCID.
> > For example, guest that use PCIDs for minimizing TLB flushes (Linux
> > has such mechanism) would have improved performance. However, the
> > main drawback of this approach is that GPTs mapping user-space
> > portion of process will have multiple SPTs mirroring them for
> > user/kernel PCIDs. Therefore, this approach utilize more SPT pages
> > and therefore could result in more MMU flushes. This is in contrast
> > to this patch series which can share SPTs mirroring the GPTs mapping
> 
> > user-space by both user PCID and kernel PCID.
> > 
> > Turned out to be quite long but wanted to just give my two cents
> > analysis on this series. :)
> 
> I agree that your approach should result in less code.  I am not sure
> where you stored the PCID bits, but if it's the hash key, then we
> might
> as well store fewer and force a TLB flush when there is a conflict.
> This should limit the amount of memory used by SPTEs.

To which PCID bits are you referring?

For every SPT I have added a 12-bit field for storing active PCID when SPT was created.
Unrelated, to support INVPCID, I added a per-vCPU bitmap of 0x1000 bits (512 bytes) that signals
to which PCIDs on next switch to them, we should force a SPT sync & TLB flush even if new CR3 bit 63 is set.
Because the bitmap is per-vCPU, it doesn't take memory from SPTs.

The reason I said my approach consumes more SPT pages is because SPTs mirroring the same GPT in same
paging-mode and same paging-level are duplicated between different PCIDs.

Also, I didn't quite got your opinion regarding the trade-off between the different approaches.

> 
> Paolo