Re: [PATCH v4 0/6] KVM: MMU: performance tweaks for heavy CR0.WP users

Mathias Krause <minipli@xxxxxxxxxxxxxx> · Fri, 14 Apr 2023 11:29:20 +0200

On 06.04.23 15:22, Mathias Krause wrote:
> On 06.04.23 04:25, Sean Christopherson wrote:
>> On Sat, Mar 25, 2023, Greg KH wrote:
>>> On Sat, Mar 25, 2023 at 12:39:59PM +0100, Mathias Krause wrote:
>>>> As this is a huge performance fix for us, we'd like to get it integrated
>>>> into current stable kernels as well -- not without having the changes
>>>> get some wider testing, of course, i.e. not before they end up in a
>>>> non-rc version released by Linus. But I already did a backport to 5.4 to
>>>> get a feeling how hard it would be and for the impact it has on older
>>>> kernels.
>>>>
>>>> Using the 'ssdd 10 50000' test I used before, I get promising results
>>>> there as well. Without the patches it takes 9.31s, while with them we're
>>>> down to 4.64s. Taking into account that this is the runtime of a
>>>> workload in a VM that gets cut in half, I hope this qualifies as stable
>>>> material, as it's a huge performance fix.
>>>>
>>>> Greg, what's your opinion on it? Original series here:
>>>> https://lore.kernel.org/kvm/20230322013731.102955-1-minipli@xxxxxxxxxxxxxx/
>>>
>>> I'll leave the judgement call up to the KVM maintainers, as they are the
>>> ones that need to ack any KVM patch added to stable trees.
>>
>> These are quite risky to backport.  E.g. we botched patch 6[*], and my initial
>> fix also had a subtle bug.  There have also been quite a few KVM MMU changes since
>> 5.4, so it's possible that an edge case may exist in 5.4 that doesn't exist in
>> mainline.
> 
> I totally agree. Getting the changes to work with older kernels needs
> more work. The MMU role handling was refactored in 5.14 and down to 5.4
> it differs even more, so backports to earlier kernels definitely needs
> more care.
> 
> My plan would be to limit backporting of the whole series to kernels
> down to 5.15 (maybe 5.10 if it turns out to be doable) and for kernels
> before that only without patch 6. That would leave out the problematic
> change but still give us the benefits of dropping the needless mmu
> unloads for only toggling CR0.WP in the VM. This already helps us a lot!

To back up the "helps us a lot" with some numbers, here are the results
I got from running the 'ssdd 10 50000' micro-benchmark on the backports
I did, running on a grsecurity L1 VM (host is a vanilla kernel, as
stated below; runtime in seconds, lower is better):

                          legacy     TDP    shadow
    Linux v5.4.240          -        8.87s   56.8s
    + patches               -        5.84s   55.4s

    Linux v5.10.177       10.37s    88.7s    69.7s
    + patches              4.88s     4.92s   70.1s

    Linux v5.15.106        9.94s    66.1s    64.9s
    + patches              4.81s     4.79s   64.6s

    Linux v6.1.23          7.65s    8.23s    68.7s
    + patches              3.36s    3.36s    69.1s

    Linux v6.2.10          7.61s    7.98s    68.6s
    + patches              3.37s    3.41s    70.2s

I guess we can grossly ignore the shadow MMU numbers, beside noting them
to regress from v5.4 to v5.10 (something to investigate?). The backports
don't help (much) for shadow MMU setups and the flux in the measurements
is likely related to the slab allocations involved.

Another unrelated data point is that TDP MMU is really broken for our
use case on v5.10 and v5.15 -- it's even slower that shadow paging!

OTOH, the backports give nice speed-ups, ranging from ~2.2 times faster
for pure EPT (legacy) MMU setups up to 18(!!!) times faster for TDP MMU
on v5.10.

I backported the whole series down to v5.10 but left out the CR0.WP
guest owning patch+fix for v5.4 as the code base is too different to get
all the nuances right, as Sean already hinted. However, even this
limited backport provides a big performance fix for our use case!

Thanks,
Mathias